document indexing
Recently Published Documents


TOTAL DOCUMENTS

105
(FIVE YEARS 13)

H-INDEX

11
(FIVE YEARS 1)

2020 ◽  
pp. 016555152097743
Author(s):  
Ahmad Aghaebrahimian ◽  
Andy Stauder ◽  
Michael Ustaszewski

The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this article, a hierarchical taxonomy of three-level depth is extracted from the Wikipedia category system. The resulting taxonomy is explored as a lightweight alternative to expert-created knowledge organisation systems (e.g. library classification systems) for the manual labelling of open-domain text corpora. Combining quantitative and qualitative data from a crowd-based text labelling study, the validity of the taxonomy is tested and the results quantified in terms of interrater agreement. While the usefulness of the Wikipedia category system for automatic document indexing is documented in the pertinent literature, our results suggest that at least the taxonomy we derived from it is not a valid instrument for manual subject matter labelling of open-domain text corpora.


2020 ◽  
Vol 25 (2) ◽  
pp. 42-52
Author(s):  
Hrytsiuk V.V. ◽  

The article defines the algorithm and details the sequential tasks for building an effective model of automated classification of events in the information space. On the eve and during the armed aggression of the Russian Federation against Ukraine, the consequences of external negative information influence were noticeable. Therefore, the organization and implementation of counteraction to such influence is urgent. An important component of this activity is the classification (clustering) of information events in the information space in order to further analyze them and form proposals for decision-making to counteract the negative information impact. Given the fact that in the global information space and, in particular, the information space of the state in the interests of counteracting such influence, it is necessary to constantly process a significant amount of information, so the task of improving the efficiency of this process is provided by automating its components. The algorithm of the automated classification process is based on a number of consecutive tasks, namely: data retrieval, preelection of messages ("rough" classification), saving pre-selected messages in the database, determining a set of indicators for automated classification of information events, pre-processing a single document (indexing), distribution of messages by criteria by categories ("accurate" classification), presentation of information in a convenient form (visualization), saving the results of classification in the database. The proposed material reveals the content of these tasks. The proposed algorithm will serve to automatically divide information events (messages) of different nature into categories (classes) in order to increase the efficiency of assessing the level of negative information impact on target audiences for timely (proactive) response to its manifestations.


Author(s):  
Rida Khalloufi ◽  
Rachid El Ayachi ◽  
Mohamed Biniz ◽  
Mohamed Fakir ◽  
Muhammad Sarfraz

Document indexing is an active domain, which is interesting a lot of researchers. Generally, it is used in the information retrieval systems. Document indexing encompasses a set of approaches that can be applied to index a document using a corpus. This treatment has several advantages, like accelerating the research process, finding the pertinent contains related to a query, reducing storage space, etc. The use of the entire document in the indexing process affects several parameters, such as indexing time, research time, storage space of treatment, etc. The focus of this chapter is to improve all parameters (cited above) related to the indexing process by proposing a new indexing approach. The goal of proposed approach is to use a summarization to minimize the size of documents without affecting the meaning.


This paper describes the cross-language plagiarism detection method CLAD (Cross-Language Analog Detector) between test document and indexed documents. The main difference of this method from existing versions is the detection of plagiarism among multiple languages not only two languages. While translating terms, it used the dictionary-based machine-translation method. CLAD’s working process consists of document indexing and detection process phases. In this paper, we will describe both of these phases.


Sign in / Sign up

Export Citation Format

Share Document