Max stable set problem to found the initial centroids in clustering problem

Awatif Karim; Chakir Loqman; Youssef Hami; Jaouad Boumhidi

doi:10.11591/ijeecs.v25.i1.pp569-579

Max stable set problem to found the initial centroids in clustering problem

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v25.i1.pp569-579 ◽

2022 ◽

Vol 25 (1) ◽

pp. 569

Author(s):

Awatif Karim ◽

Chakir Loqman ◽

Youssef Hami ◽

Jaouad Boumhidi

Keyword(s):

Document Clustering ◽

Large Data ◽

Hopfield Network ◽

Large Data Sets ◽

Stable Set ◽

Data Sets ◽

Clustering Problem ◽

Text Document ◽

Stable Set Problem

In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.

Download Full-text

The applying of machine learning methods to improve the quality of well casing

Oil and Gas Studies ◽

10.31660/0445-0108-2020-5-81-93 ◽

2020 ◽

pp. 81-93

Author(s):

D. V. Shalyapin ◽

D. L. Bakirov ◽

M. M. Fattakhov ◽

A. D. Shalyapina ◽

A. V. Melekhov ◽

...

Keyword(s):

Oil And Gas ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

New Approach ◽

Geological Conditions ◽

Expert Assessments ◽

The Impact ◽

The Relationship

The article is devoted to the quality of well casing at the Pyakyakhinskoye oil and gas condensate field. The issue of improving the quality of well casing is associated with many problems, for example, a large amount of work on finding the relationship between laboratory studies and actual data from the field; the difficulty of finding logically determined relationships between the parameters and the final quality of well casing. The text gives valuable information on a new approach to assessing the impact of various parameters, based on a mathematical apparatus that excludes subjective expert assessments, which in the future will allow applying this method to deposits with different rock and geological conditions. We propose using the principles of mathematical processing of large data sets applying neural networks trained to predict the characteristics of the quality of well casing (continuity of contact of cement with the rock and with the casing). Taking into account the previously identified factors, we developed solutions to improve the tightness of the well casing and the adhesion of cement to the limiting surfaces.

Download Full-text

Comparison of marker selection methods for high throughput scRNA-seq data

10.1101/679761 ◽

2019 ◽

Author(s):

Anna C. Gilbert ◽

Alexander Vargo

Keyword(s):

Performance Measures ◽

Synthetic Data ◽

Large Data ◽

Ground Truth ◽

Selection Method ◽

Large Data Sets ◽

Data Sets ◽

Selection Methods ◽

Marker Selection

AbstractHere, we evaluate the performance of a variety of marker selection methods on scRNA-seq UMI counts data. We test on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. In addition, we propose several performance measures for evaluating the quality of a set of markers when there is no known ground truth. According to these metrics, most existing marker selection methods show similar performance on experimental scRNA-seq data; thus, the speed of the algorithm is the most important consid-eration for large data sets. With this in mind, we introduce RANKCORR, a fast marker selection method with strong mathematical underpinnings that takes a step towards sensible multi-class marker selection.

Download Full-text

Location-Based Collaborative Filtering for Web Service Recommendation

Recent Patents on Computer Science ◽

10.2174/2213275911666181025130059 ◽

2019 ◽

Vol 12 (1) ◽

pp. 34-40

Author(s):

Mareeswari Venkatachalaappaswamy ◽

Vijayan Ramaraj ◽

Saranya Ravichandran

Keyword(s):

Collaborative Filtering ◽

Web Service ◽

Information Filtering ◽

Service Selection ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Service Recommendation ◽

Location Aware

Background: In many modern applications, information filtering is now used that exposes users to a collection of data. In such systems, the users are provided with recommended items’ list they might prefer or predict the rate that they might prefer for the items. So that, the users might be select the items that are preferred in that list. Objective: In web service recommendation based on Quality of Service (QoS), predicting QoS value will greatly help people to select the appropriate web service and discover new services. Methods: The effective method or technique for this would be Collaborative Filtering (CF). CF will greatly help in service selection and web service recommendation. It is the more general way of information filtering among the large data sets. In the narrower sense, it is the method of making predictions about a user’s interest by collecting taste information from many users. Results: It is easy to build and also much more effective for recommendations by predicting missing QoS values for the users. It also addresses the scalability problem since the recommendations are based on like-minded users using PCC or in clusters using KNN rather than in large data sources. Conclusion: In this paper, location-aware collaborative filtering is used to recommend the services. The proposed system compares the prediction outcomes and execution time with existing algorithms.

Download Full-text

Clustering techniques and their applications in engineering

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/09544062jmes508 ◽

2007 ◽

Vol 221 (11) ◽

pp. 1445-1459 ◽

Cited By ~ 19

Author(s):

D T Pham ◽

A A Afify

Keyword(s):

Data Mining ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Monitoring And Control ◽

Design Quality ◽

Clustering Problem ◽

Manufacturing System Design ◽

And Control

Clustering is an important data exploration technique with many applications in different areas of engineering, including engineering design, manufacturing system design, quality assurance, production planning and process planning, modelling, monitoring, and control. The clustering problem has been addressed by researchers from many disciplines. However, efforts to perform effective and efficient clustering on large data sets only started in recent years with the emergence of data mining. The current paper presents an overview of clustering algorithms from a data mining perspective. Attention is paid to techniques of scaling up these algorithms to handle large data sets. The paper also describes a number of engineering applications to illustrate the potential of clustering algorithms as a tool for handling complex real-world problems.

Download Full-text

Inherent Limitations of Hospital Death Rates to Assess Quality

International Journal of Technology Assessment in Health Care ◽

10.1017/s026646230000074x ◽

1990 ◽

Vol 6 (2) ◽

pp. 220-228 ◽

Cited By ~ 9

Author(s):

Robert W. Dubois

Keyword(s):

Death Rate ◽

Large Data ◽

Potential Method ◽

Large Data Sets ◽

Hospital Death ◽

Data Sets ◽

Death Rates ◽

Rate Study ◽

Assess Quality

AbstractModeling death rates has been suggested as a potential method to screen hospitals and identify superior and substandard providers. This article begins with a review of one hospital death rate study and focuses upon its findings and limitations. It also explores the inherent limitations in the use of large data sets to assess quality of care.

Download Full-text

Current Uses of Large Data Sets to Assess the Quality of Providers: Construction of Risk-Adjusted Indexes of Hospital Performance

International Journal of Technology Assessment in Health Care ◽

10.1017/s0266462300000751 ◽

1990 ◽

Vol 6 (2) ◽

pp. 229-238 ◽

Cited By ~ 16

Author(s):

Susan Desharnais

Keyword(s):

Health Policy ◽

Large Data ◽

Patient Discharge ◽

Hospital Performance ◽

Large Data Sets ◽

Data Sets ◽

Policy Changes

AbstractThis article examines how large data sets can be used for evaluating the effects of health policy changes and for flagging providers with potential quality problems. An example is presented, illustrating how three risk-adjusted measures of hospital performance were developed using patient discharge abstracts. Advantages and disadvantage of this approach are discussed.

Download Full-text

DATA MINING KLASTERISASI PENJUALAN ALAT-ALAT BANGUNAN MENGGUNAKAN METODE K-MEANS (STUDI KASUS DI TOKO ADI BANGUNAN)

JURNAL TEKNOLOGI DAN OPEN SOURCE ◽

10.36378/jtos.v1i2.24 ◽

2018 ◽

Vol 1 (2) ◽

pp. 83-91

Author(s):

M. Hasyim Siregar

Keyword(s):

Data Mining ◽

Cost Reduction ◽

Building Materials ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Operational Cost ◽

Use Of Data

In the world of business competition today, we are required to continually develop business to always survive in the competition. To achieve this there are a few things that can be done is to improve the quality of the product, adding the type of product and operational cost reduction company with how to use data analysis of the company. Data mining is a technology that automate the process to find interesting patterns and sensitive from the large data sets. This allows human understanding about finding patterns and scalability techniques. The store Adi Bangunan is a shop which is engaged in the sale of building materials and household who have such a system on supermarket namely buyers took own goods that will be purchased. Sales data, purchase goods or reimbursed some unexpected is not well ordered, so that the data is only function as archive for the store and cannot be used for the development of marketing strategy. In this research, data mining applied using the model of the process of K-Means that provides a standard process for the use of data mining in various areas used in the classification of because the results of this method can be easily understood and interpreted.

Download Full-text

The Turker Blues: Hidden Factors Behind Increased Depression Rates Among Amazon’s Mechanical Turkers

Clinical Psychological Science ◽

10.1177/2167702619865973 ◽

2019 ◽

Vol 8 (1) ◽

pp. 65-83 ◽

Cited By ~ 7

Author(s):

Yaakov Ophir ◽

Itay Sisso ◽

Christa S. C. Asterhan ◽

Refael Tikochinski ◽

Roi Reichart

Keyword(s):

Major Depression ◽

General Population ◽

Large Data ◽

Large Data Sets ◽

Population Estimates ◽

Data Sets ◽

Online Platforms ◽

Amazon's Mechanical Turk ◽

Quality Tools

Data collection from online platforms, such as Amazon’s Mechanical Turk (MTurk), has become popular in clinical research. However, there are also concerns about the representativeness and the quality of these data for clinical studies. The present work explores these issues in the specific case of major depression. Analyses of two large data sets gathered from MTurk (Sample 1: N = 2,692; Sample 2: N = 2,354) revealed two major findings: First, failing to screen for inattentive and fake respondents inflates the rates of major depression artificially and significantly (by 18.5%–27.5%). Second, after cleaning the data sets, depression in MTurk is still 1.6 to 3.6 times higher than general population estimates. Approximately half of this difference can be attributed to differences in the composition of MTurk samples and the general population (i.e., sociodemographics, health, and physical activity lifestyle). Several explanations for the other half are proposed, and practical data-quality tools are provided.

Download Full-text

Features of the application of mathematical models of tests in the conditions of remote control

Mathematical machines and systems ◽

10.34121/1028-9763-2020-2-105-116 ◽

2020 ◽

Vol 2 ◽

pp. 105-116

Author(s):

N.V. Kruglova ◽

◽

O.O. Dykhovychnyi ◽

I.V. Alekseeva ◽

◽

...

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Test Results ◽

Test Quality ◽

Test Analysis ◽

Objective Control ◽

Object Of Study ◽

Kyiv Polytechnic Institute

The paper explores one of the current issues of distance education - the quality of computer tests in terms of ensuring objective control of knowledge. This issue is especially important in today's pandemic and temporary quarantine. The main attention is paid to a statistical analysis of test quality based on test results using CTT and IRT methods. Using modern statistical methods, the authors analyzed the results of testing prepared and conducted during the quarantine period. As an object of study, a test on “Integration of functions of one variable” was chosen, which students completely mastered remotely. The tests were created on the basis of the MOODLE platform at Igor Sikorsky Kyiv Polytechnic Institute, by proffessors of the Department of Mathematical Analysis and Probability Theory. Data processing is carried out using a system for test analysis, created by the authors in the programming environment R. The system allows you to process tests in different areas: pedagogy, psychology, sociology, etc., different in structure; use both CTT and IRT apparatus; work with large data sets; to analyze not only test questions, but also respondents; more accurately differentiate respondents. Based on the study, the possibility of conducting electronic testing remotely was confirmed. The technology used in the study can be used to create and analyze the tasks of external evaluation, conducting session control during quarantine. The use of the methods studied in the work for the analysis of test tasks will increase the competence of high school teachers to conduct electronic remote testing.

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text