document collections
Recently Published Documents


TOTAL DOCUMENTS

369
(FIVE YEARS 59)

H-INDEX

27
(FIVE YEARS 2)

2022 ◽  
Vol 40 (3) ◽  
pp. 1-24
Author(s):  
Jiaul H. Paik ◽  
Yash Agrawal ◽  
Sahil Rishi ◽  
Vaishal Shah

Existing probabilistic retrieval models do not restrict the domain of the random variables that they deal with. In this article, we show that the upper bound of the normalized term frequency ( tf ) from the relevant documents is much smaller than the upper bound of the normalized tf from the whole collection. As a result, the existing models suffer from two major problems: (i) the domain mismatch causes data modeling error, (ii) since the outliers have very large magnitude and the retrieval models follow tf hypothesis, the combination of these two factors tends to overestimate the relevance score. In an attempt to address these problems, we propose novel weighted probabilistic models based on truncated distributions. We evaluate our models on a set of large document collections. Significant performance improvement over six existing probabilistic models is demonstrated.


2022 ◽  
Vol 40 (1) ◽  
pp. 1-32
Author(s):  
Joel Mackenzie ◽  
Matthias Petri ◽  
Alistair Moffat

Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top- disjunctions but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency that is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited to comply with SLA requirements.


Author(s):  
Meng Yuan ◽  
Justin Zobel ◽  
Pauline Lin

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.


2022 ◽  
pp. 214-234
Author(s):  
Tugce Aldemir ◽  
Amine Hatun Ataş ◽  
Berkan Celik

This formative research study is an attempt to develop a design model for gamified learning experiences situated in real-life educational contexts. This chapter reports on the overall gamification model with the emphasis on the contexts and their interactions. With this focus, this chapter aims to posit an alternative perspective to existing gamification design praxis in education which mainly focuses on separate game elements, by arguing that designing a gamified learning experience needs a systematic approach with considerations of the interrelated dimensions and their interplays. The study was conducted throughout the 2014-15 academic year, and the data were collected from two separate groups of pre-service teachers through observations and document collections (n=118) and four sets of interviews (n=42). The results showed that gamification design has intertwined components that form a fuzzy design model: GELD. The findings also support the complex and the dynamic nature of gamified learning design, and the need for a more systematic approach to design and development of such experiences.


2021 ◽  
Author(s):  
◽  
Anna Friedlander

<p>The sheer volume of data to be produced by the next generation of radio telescopes—exabytes of data on hundreds of millions of objects—makes automated methods for the detection of astronomical objects ("sources") essential. Of particular importance are low surface brightness objects, which are not well found by current automated methods.  This thesis explores Bayesian methods for source detection that use Dirichlet or multinomial models for pixel intensity distributions in discretised radio astronomy images. A novel image discretisation method that incorporates uncertainty about how the image should be discretised is developed. Latent Dirichlet allocation — a method originally developed for inferring latent topics in document collections — is used to estimate source and background distributions in radio astronomy images. A new Dirichlet-multinomial ratio, indicating how well a region conforms to a well-specified model of background versus a loosely-specified model of foreground, is derived. Finally, latent Dirichlet allocation and the Dirichlet-multinomial ratio are combined for source detection in astronomical images.   The methods developed in this thesis perform source detection well in comparison to two widely-used source detection packages and, importantly, find dim sources not well found by other algorithms.</p>


2021 ◽  
Author(s):  
◽  
Anna Friedlander

<p>The sheer volume of data to be produced by the next generation of radio telescopes—exabytes of data on hundreds of millions of objects—makes automated methods for the detection of astronomical objects ("sources") essential. Of particular importance are low surface brightness objects, which are not well found by current automated methods.  This thesis explores Bayesian methods for source detection that use Dirichlet or multinomial models for pixel intensity distributions in discretised radio astronomy images. A novel image discretisation method that incorporates uncertainty about how the image should be discretised is developed. Latent Dirichlet allocation — a method originally developed for inferring latent topics in document collections — is used to estimate source and background distributions in radio astronomy images. A new Dirichlet-multinomial ratio, indicating how well a region conforms to a well-specified model of background versus a loosely-specified model of foreground, is derived. Finally, latent Dirichlet allocation and the Dirichlet-multinomial ratio are combined for source detection in astronomical images.   The methods developed in this thesis perform source detection well in comparison to two widely-used source detection packages and, importantly, find dim sources not well found by other algorithms.</p>


2021 ◽  
Vol 8 (5) ◽  
pp. 1029
Author(s):  
Aisyatul Maulidah ◽  
Fitra A. Bachtiar

<p class="Abstrak">Google Review pada salah satu fitur Google Maps dapat menjadi salah satu media untuk mengukur tingkat kepuasan pengunjung Jawa Timur Park 3 (Jatim Park 3). Akan tetapi jumlah ulasan yang mencapai ribuan dan belum tersedianya media pengelola data ulasan dapat mempersulit manajemen Jatim Park 3 dalam mengeksplorasi dan menganalisis masukan pengunjung secara mendetail. Penelitian ini memanfaatkan teknik <em>Association Rule Mining </em>(ARM) dalam mengelola data ulasan sehingga dapat menemukan hubungan kata yang sering muncul pada ulasan. Teknik ini paling populer untuk menemukan hubungan tersembunyi antar variabel. Algoritma yang digunakan dalam mengimplementasikannya adalah algoritma Apriori karena dianggap paling efisien. Pada penelitian ini menggunakan data ulasan sebanyak 1067 ulasan dalam Bahasa Indonesia dari bulan Januari sampai bulan April tahun 2019. Berdasarkan wawancara, data tersebut digolongkan menjadi 8 aspek berdasarkan kata kunci yang sudah ditentukan sebelumnya. Aspek tersebut antara lain akses jalan, biaya, kebersihan, kepuasan, keramaian, pelayanan, keamanan, dan teknologi. Pengujian dilakukan untuk mengetahui pengaruh <em>minimum support</em> dan <em>minimum confidence</em> terhadap <em>rule</em> yang terbentuk. Keseluruhan aspek mampu menghasilkan asosiasi kata dengan algoritma Apriori. Selain itu, Keseluruhan <em>rule</em> yang terbentuk menghasilkan rata-rata <em>lift ratio</em> di atas 1 dimana rule dengan nilai lift ratio diatas 1 tersebut merupakan rule yang unik diantara rule-rule lain yang terebentuk dari asosiasi tersebut. Pada penelitian ini, rule yang terbentuk divisualisasikan untuk menampilkan keterkaitan antara kata kunci dengan aspek pada data ulasan pengunjung Jatim Park 3. Penelitian ini mencoba menggali informasi mengenai pemetaan layanan mana saja yang mendapatkan perhatian pengunjung di Jatim Park 3.</p><p class="Abstrak" align="center"> </p><p class="Judul2"><strong><em>Abstract</em></strong></p><p class="Judul2"> <em>Google Review, which is one of the features of Google Maps can be a medium to measure the satisfaction rate visitors of Jawa Timur Park 3 (Jatim Park 3). the number of reviews that reached thousands and media of review data manager is not available yet complicate the management of Jatim Park to explore and analyze visitor feedback in detail. The Association Rule Mining (ARM) technique is a text mining method that can support the knowledge discovery process in large document collections. ARM is able to link keywords to comments to find words that appear frequently. This technique is most popular for finding hidden relationships between variables. The algorithm used in this study is apriori algorithm because it is the most efficient. In this study, there are 1067 reviews of the visitors in Indonesian from January to April 2019 as the data. The data is classified into 8 aspects based on predetermined keywords. These aspects include road access, cost, cleanliness, satisfaction, hustle, service, security, and technology. Testing was conducted to determine the minimum support and minimum confidence impact of the established rules. The whole aspects is capable of generating word associations with an Apriori algorithm. In addition, the overall rules that are formed produce an average lift ratio above 1 where the rule with that value is a unique rule among other rules formed from the association. In this study, the rules that are formed are visualized to show the relationship between keywords and aspects of visitor reviews of Jatim Park 3. This research tries to dig up information about mapping which services get the attention of visitors in Jatim Park 3.</em></p>


2021 ◽  
pp. 1-20 ◽  
Author(s):  
Luwei Ying ◽  
Jacob M. Montgomery ◽  
Brandon M. Stewart

Abstract Topic models, as developed in computer science, are effective tools for exploring and summarizing large document collections. When applied in social science research, however, they are commonly used for measurement, a task that requires careful validation to ensure that the model outputs actually capture the desired concept of interest. In this paper, we review current practices for topic validation in the field and show that extensive model validation is increasingly rare, or at least not systematically reported in papers and appendices. To supplement current practices, we refine an existing crowd-sourcing method by Chang and coauthors for validating topic quality and go on to create new procedures for validating conceptual labels provided by the researcher. We illustrate our method with an analysis of Facebook posts by U.S. Senators and provide software and guidance for researchers wishing to validate their own topic models. While tailored, case-specific validation exercises will always be best, we aim to improve standard practices by providing a general-purpose tool to validate topics as measures.


2021 ◽  
Author(s):  
Ashwini Sapkal ◽  
Chhavi ◽  
Shashank Sharma ◽  
Pradeep Kumar ◽  
Sachin Yadav

Information ◽  
2021 ◽  
Vol 12 (9) ◽  
pp. 348 ◽  
Author(s):  
Marten Düring ◽  
Roman Kalyakin ◽  
Estelle Bunout ◽  
Daniele Guido

The automated enrichment of mass-digitised document collections using techniques such as text mining is becoming increasingly popular. Enriched collections offer new opportunities for interface design to allow data-driven and visualisation-based search, exploration and interpretation. Most such interfaces integrate close and distant reading and represent semantic, spatial, social or temporal relations, but often lack contrastive views. Inspect and Compare (I&C) contributes to the current state of the art in interface design for historical newspapers with highly versatile side-by-side comparisons of query results and curated article sets based on metadata and semantic enrichments. I&C takes search queries and pre-curated article sets as inputs and allows comparisons based on the distributions of newspaper titles, publication dates and automatically generated enrichments, such as language, article types, topics and named entities. Contrastive views of such data reveal patterns, help humanities scholars to improve search strategies and to facilitate a critical assessment of the overall data quality. I&C is part of the impresso interface for the exploration of digitised and semantically enriched historical newspapers.


Sign in / Sign up

Export Citation Format

Share Document