test collections
Recently Published Documents


TOTAL DOCUMENTS

101
(FIVE YEARS 25)

H-INDEX

13
(FIVE YEARS 2)

2022 ◽  
Vol 40 (4) ◽  
pp. 1-35
Author(s):  
Tetsuya Sakai ◽  
Sijie Tao ◽  
Zhaohao Zeng

In the context of depth- k pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: The prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 dataset, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in terms of assessment efficiency. Our main findings are: (a) The presentation order has no substantial impact on assessment efficiency; (b) While the presentation order substantially affects which documents are judged (highly) relevant, the difference between the inter-assessor agreement under the PRI condition and that under the RND condition is of no practical significance; (c) Different system rankings under the PRI condition are substantially more similar to one another than those under the RND condition; and (d) PRI-based relevance assessment files (qrels) are substantially and statistically significantly more robust to new systems than RND-based ones. Finding (d) suggests that PRI helps the assessors identify relevant documents that affect the evaluation of many existing systems, including those that did not contribute to the pools. Hence, if researchers need to evaluate their current IR systems using legacy IR test collections, we recommend the use of those constructed using the PRI approach unless they have a good reason to believe that their systems retrieve relevant documents that are vastly different from the pooled documents. While this robustness of PRI may also mean that the PRI-based pools are biased against future systems that retrieve highly novel relevant documents, one should note that there is no evidence that RND is any better in this respect.


2021 ◽  
Vol 39 (4) ◽  
pp. 1-42
Author(s):  
Zeyang Liu ◽  
Ke Zhou ◽  
Max L. Wilson

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability : the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity : the ability to agree with ultimate user preference; and (3) intuitiveness : the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 402
Author(s):  
Stefano Marchesin ◽  
Giorgio Maria Di Nunzio ◽  
Maristella Agosti

In Information Retrieval (IR), the semantic gap represents the mismatch between users’ queries and how retrieval models answer to these queries. In this paper, we explore how to use external knowledge resources to enhance bag-of-words representations and reduce the effect of the semantic gap between queries and documents. In this regard, we propose several simple but effective knowledge-based query expansion and reduction techniques, and we evaluate them for the medical domain. The query reformulations proposed are used to increase the probability of retrieving relevant documents through the addition to, or the removal from, the original query of highly specific terms. The experimental analyses on different test collections for Precision Medicine IR show the effectiveness of the developed techniques. In particular, a specific subset of query reformulations allow retrieval models to achieve top performing results in all the considered test collections.


2021 ◽  
Vol 55 (1) ◽  
pp. 1-2
Author(s):  
Stefano Marchesin

In this thesis we tackle the semantic gap, a long-standing problem in Information Retrieval (IR). The semantic gap can be described as the mismatch between users' queries and the way retrieval models answer to such queries. Two main lines of work have emerged over the years to bridge the semantic gap: (i) the use of external knowledge resources to enhance the bag-of-words representations used by lexical models, and (ii) the use of semantic models to perform matching between the latent representations of queries and documents. To deal with this issue, we first perform an in-depth evaluation of lexical and semantic models through different analyses [Marchesin et al., 2019]. The objective of this evaluation is to understand what features lexical and semantic models share, if their signals are complementary, and how they can be combined to effectively address the semantic gap. In particular, the evaluation focuses on (semantic) neural models and their critical aspects. Each analysis brings a different perspective in the understanding of semantic models and their relation with lexical models. The outcomes of this evaluation highlight the differences between lexical and semantic signals, and the need to combine them at the early stages of the IR pipeline to effectively address the semantic gap. Then, we build on the insights of this evaluation to develop lexical and semantic models addressing the semantic gap. Specifically, we develop unsupervised models that integrate knowledge from external resources, and we evaluate them for the medical domain - a domain with a high social value, where the semantic gap is prominent, and the large presence of authoritative knowledge resources allows us to explore effective ways to address it. For lexical models, we investigate how - and to what extent - concepts and relations stored within knowledge resources can be integrated in query representations to improve the effectiveness of lexical models. Thus, we propose and evaluate several knowledge-based query expansion and reduction techniques [Agosti et al., 2018, 2019; Di Nunzio et al., 2019]. These query reformulations are used to increase the probability of retrieving relevant documents by adding to or removing from the original query highly specific terms. The experimental analyses on different test collections for Precision Medicine - a particular use case of Clinical Decision Support (CDS) - show the effectiveness of the proposed query reformulations. In particular, a specific subset of query reformulations allow lexical models to achieve top performing results in all the considered collections. Regarding semantic models, we first analyze the limitations of the knowledge-enhanced neural models presented in the literature. Then, to overcome these limitations, we propose SAFIR [Agosti et al., 2020], an unsupervised knowledge-enhanced neural framework for IR. SAFIR integrates external knowledge in the learning process of neural IR models and it does not require labeled data for training. Thus, the representations learned within this framework are optimized for IR and encode linguistic features that are relevant to address the semantic gap. The evaluation on different test collections for CDS demonstrate the effectiveness of SAFIR when used to perform retrieval over the entire document collection or to retrieve documents for Pseudo Relevance Feedback (PRF) methods - that is, when it is used at the early stages of the IR pipeline. In particular, the quantitative and qualitative analyses highlight the ability of SAFIR to retrieve relevant documents affected by the semantic gap, as well as the effectiveness of combining lexical and semantic models at the early stages of the IR pipeline - where the complementary signals they provide can be used to obtain better answers to semantically hard queries.


2021 ◽  
Vol 7 ◽  
Author(s):  
Felicitas Löffler ◽  
Andreas Schuldt ◽  
Birgitta König-Ries ◽  
Helge Bruelheide ◽  
Friederike Klan

Searching for scientific datasets is a prominent task in scholars' daily research practice. A variety of data publishers, archives and data portals offer search applications that allow the discovery of datasets. The evaluation of such dataset retrieval systems requires proper test collections, including questions that reflect real world information needs of scholars, a set of datasets and human judgements assessing the relevance of the datasets to the questions in the benchmark corpus. Unfortunately, only very few test collections exist for a dataset search. In this paper, we introduce the BEF-China test collection, the very first test collection for dataset retrieval in biodiversity research, a research field with an increasing demand in data discovery services. The test collection consists of 14 questions, a corpus of 372 datasets from the BEF-China project and binary relevance judgements provided by a biodiversity expert.


2020 ◽  
Vol 54 (2) ◽  
pp. 1-2
Author(s):  
Dan Li

The availability of test collections in Cranfield paradigm has significantly benefited the development of models, methods and tools in information retrieval. Such test collections typically consist of a set of topics, a document collection and a set of relevance assessments. Constructing these test collections requires effort of various perspectives such as topic selection, document selection, relevance assessment, and relevance label aggregation etc. The work in the thesis provides a fundamental way of constructing and utilizing test collections in information retrieval in an effective, efficient and reliable manner. To that end, we have focused on four aspects. We first study the document selection issue when building test collections. We devise an active sampling method for efficient large-scale evaluation [Li and Kanoulas, 2017]. Different from past sampling-based approaches, we account for the fact that some systems are of higher quality than others, and we design the sampling distribution to over-sample documents from these systems. At the same time, the estimated evaluation measures are unbiased, and assessments can be used to evaluate new, novel systems without introducing any systematic error. Then a natural further step is determining when to stop the document selection and assessment procedure. This is an important but understudied problem in the construction of test collections. We consider both the gain of identifying relevant documents and the cost of assessing documents as the optimization goals. We handle the problem under the continuous active learning framework by jointly training a ranking model to rank documents, and estimating the total number of relevant documents in the collection using a "greedy" sampling method [Li and Kanoulas, 2020]. The next stage of constructing a test collection is assessing relevance. We study how to denoise relevance assessments by aggregating from multiple crowd annotation sources to obtain high-quality relevance assessments. This helps to boost the quality of relevance assessments acquired in a crowdsourcing manner. We assume a Gaussian process prior on query-document pairs to model their correlation. The proposed model shows good performance in terms of interring true relevance labels. Besides, it allows predicting relevance labels for new tasks that has no crowd annotations, which is a new functionality of CrowdGP. Ablation studies demonstrate that the effectiveness is attributed to the modelling of task correlation based on the axillary information of tasks and the prior relevance information of documents to queries. After a test collection is constructed, it can be used to either evaluate retrieval systems or train a ranking model. We propose to use it to optimize the configuration of retrieval systems. We use Bayesian optimization approach to model the effect of a δ -step in the configuration space to the effectiveness of the retrieval system, by suggesting to use different similarity functions (covariance functions) for continuous and categorical values, and examine their ability to effectively and efficiently guide the search in the configuration space [Li and Kanoulas, 2018]. Beyond the algorithmic and empirical contributions, work done as part of this thesis also contributed to the research community as the CLEF Technology Assisted Reviews in Empirical Medicine Tracks in 2017, 2018, and 2019 [Kanoulas et al., 2017, 2018, 2019]. Awarded by: University of Amsterdam, Amsterdam, The Netherlands. Supervised by: Evangelos Kanoulas. Available at: https://dare.uva.nl/search?identifier=3438a2b6-9271-4f2c-add5-3c811cc48d42.


2020 ◽  
Vol 54 (2) ◽  
pp. 1-10
Author(s):  
Avi Arampatzis ◽  
Linda Cappellato ◽  
Carsten Eickhoff ◽  
Nicola Ferro ◽  
Hideo Joho ◽  
...  

This is a report on the tenth edition of the Conference and Labs of the Evaluation Forum (CLEF 2020), (virtually) held from September 22--25, 2020, in Thessaloniki, Greece. CLEF was a four day event combining a Conference and an Evaluation Forum. The Conference featured keynotes by Ellen Voorhees and Yiannis Kompasiaris, and presentation of peer reviewed research papers covering a wide range of topics in addition to many posters. The Evaluation Forum consisted to twelve Labs: ARQMath, BioASQ, CheckThat!, ChEMU, CLEF eHealth, eRisk, HIPE, ImageCLEF, LifeCLEF, LiLAS, PAN, and Touché, addressing a wide range of tasks, media, languages, and ways to go beyond standard test collections.


2020 ◽  
Vol 54 (2) ◽  
pp. 1-2
Author(s):  
Kevin Roitero

To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. In this thesis we focus on three main directions: the first part details the usage of few topics (i.e., information needs) in retrieval evaluation and shows an extensive study detailing the effect of using fewer topics for retrieval evaluation in terms of number of topics, topics subsets, and statistical power. The second part of this thesis discusses the evaluation without relevance judgements, reproducing, extending, and generalizing state-of-the-art methods and investigating their combinations by means of data fusion techniques and machine learning. Finally, the third part uses crowdsourcing to gather relevance labels, and in particular shows the effect of using fine grained judgement scales; furthermore, explores methods to transform judgements between different relevance scales. Awarded by: University of Udine, Udine, Italy on 19 March 2020. Supervised by: Professor Stefano Mizzaro. Available at: https://kevinroitero.com/resources/kr-phd-thesis.pdf.


Sign in / Sign up

Export Citation Format

Share Document