Hierarchical Clustering Using Evolutionary Algorithms

Author(s):  
Monica Chis

Clustering is an important technique used in discovering some inherent structure present in data. The purpose of cluster analysis is to partition a given data set into a number of groups such that data in a particular cluster are more similar to each other than objects in different clusters. Hierarchical clustering refers to the formation of a recursive clustering of the data points: a partition into many clusters, each of which is itself hierarchically clustered. Hierarchical structures solve many problems in a large area of interests. In this paper a new evolutionary algorithm for detecting the hierarchical structure of an input data set is proposed. Problem could be very useful in economy, market segmentation, management, biology taxonomy and other domains. A new linear representation of the cluster structure within the data set is proposed. An evolutionary algorithm evolves a population of clustering hierarchies. Proposed algorithm uses mutation and crossover as (search) variation operators. The final goal is to present a data clustering representation to find fast a hierarchical clustering structure.

2013 ◽  
Vol 12 (3-4) ◽  
pp. 291-307 ◽  
Author(s):  
Ilir Jusufi ◽  
Andreas Kerren ◽  
Falk Schreiber

Ontologies and hierarchical clustering are both important tools in biology and medicine to study high-throughput data such as transcriptomics and metabolomics data. Enrichment of ontology terms in the data is used to identify statistically overrepresented ontology terms, giving insight into relevant biological processes or functional modules. Hierarchical clustering is a standard method to analyze and visualize data to find relatively homogeneous clusters of experimental data points. Both methods support the analysis of the same data set but are usually considered independently. However, often a combined view is desired: visualizing a large data set in the context of an ontology under consideration of a clustering of the data. This article proposes new visualization methods for this task. They allow for interactive selection and navigation to explore the data under consideration as well as visual analysis of mappings between ontology- and cluster-based space-filling representations. In this context, we discuss our approach together with specific properties of the biological input data and identify features that make our approach easily usable for domain experts.


2015 ◽  
Vol 14 ◽  
pp. CIN.S22080 ◽  
Author(s):  
Askar Obulkasim ◽  
Mark A. Van De Wiel

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that “haunted” high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.


Author(s):  
Zalán Bodó ◽  
Lehel Csató

Recently semi-supervised methods gained increasing attention and many novel semi-supervised learning algorithms have been proposed. These methods exploit the information contained in the usually large unlabeled data set in order to improve classification or generalization performance. Using data-dependent kernels for kernel machines one can build semi-supervised classifiers by building the kernel in such a way that feature space dot products incorporate the structure of the data set. In this paper we propose two such methods: one using specific hierarchical clustering, and another kernel for reweighting an arbitrary base kernel taking into account the cluster structure of the data.


2019 ◽  
Vol 14 (2) ◽  
pp. 148-156
Author(s):  
Nighat Noureen ◽  
Sahar Fazal ◽  
Muhammad Abdul Qadir ◽  
Muhammad Tanvir Afzal

Background: Specific combinations of Histone Modifications (HMs) contributing towards histone code hypothesis lead to various biological functions. HMs combinations have been utilized by various studies to divide the genome into different regions. These study regions have been classified as chromatin states. Mostly Hidden Markov Model (HMM) based techniques have been utilized for this purpose. In case of chromatin studies, data from Next Generation Sequencing (NGS) platforms is being used. Chromatin states based on histone modification combinatorics are annotated by mapping them to functional regions of the genome. The number of states being predicted so far by the HMM tools have been justified biologically till now. Objective: The present study aimed at providing a computational scheme to identify the underlying hidden states in the data under consideration. </P><P> Methods: We proposed a computational scheme HCVS based on hierarchical clustering and visualization strategy in order to achieve the objective of study. Results: We tested our proposed scheme on a real data set of nine cell types comprising of nine chromatin marks. The approach successfully identified the state numbers for various possibilities. The results have been compared with one of the existing models as well which showed quite good correlation. Conclusion: The HCVS model not only helps in deciding the optimal state numbers for a particular data but it also justifies the results biologically thereby correlating the computational and biological aspects.


Author(s):  
Simona Babiceanu ◽  
Sanhita Lahiri ◽  
Mena Lockwood

This study uses a suite of performance measures that was developed by taking into consideration various aspects of congestion and reliability, to assess impacts of safety projects on congestion. Safety projects are necessary to help move Virginia’s roadways toward safer operation, but can contribute to congestion and unreliability during execution, and can affect operations after execution. However, safety projects are assessed primarily for safety improvements, not for congestion. This study identifies an appropriate suite of measures, and quantifies and compares the congestion and reliability impacts of safety projects on roadways for the periods before, during, and after project execution. The paper presents the performance measures, examines their sensitivity based on operating conditions, defines thresholds for congestion and reliability, and demonstrates the measures using a set of Virginia safety projects. The data set consists of 10 projects totalling 92 mi and more than 1M data points. The study found that, overall, safety projects tended to have a positive impact on congestion and reliability after completion, and the congestion variability measures were sensitive to the threshold of reliability. The study concludes with practical recommendations for primary measures that may be used to measure overall impacts of safety projects: percent vehicle miles traveled (VMT) reliable with a customized threshold for Virginia; percent VMT delayed; and time to travel 10 mi. However, caution should be used when applying the results directly to other situations, because of the limited number of projects used in the study.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 37
Author(s):  
Shixun Wang ◽  
Qiang Chen

Boosting of the ensemble learning model has made great progress, but most of the methods are Boosting the single mode. For this reason, based on the simple multiclass enhancement framework that uses local similarity as a weak learner, it is extended to multimodal multiclass enhancement Boosting. First, based on the local similarity as a weak learner, the loss function is used to find the basic loss, and the logarithmic data points are binarized. Then, we find the optimal local similarity and find the corresponding loss. Compared with the basic loss, the smaller one is the best so far. Second, the local similarity of the two points is calculated, and then the loss is calculated by the local similarity of the two points. Finally, the text and image are retrieved from each other, and the correct rate of text and image retrieval is obtained, respectively. The experimental results show that the multimodal multi-class enhancement framework with local similarity as the weak learner is evaluated on the standard data set and compared with other most advanced methods, showing the experience proficiency of this method.


Author(s):  
Manfred Ehresmann ◽  
Georg Herdrich ◽  
Stefanos Fasoulas

AbstractIn this paper, a generic full-system estimation software tool is introduced and applied to a data set of actual flight missions to derive a heuristic for system composition for mass and power ratios of considered sub-systems. The capability of evolutionary algorithms to analyse and effectively design spacecraft (sub-)systems is shown. After deriving top-level estimates for each spacecraft sub-system based on heuristic heritage data, a detailed component-based system analysis follows. Various degrees of freedom exist for a hardware-based sub-system design; these are to be resolved via an evolutionary algorithm to determine an optimal system configuration. A propulsion system implementation for a small satellite test case will serve as a reference example of the implemented algorithm application. The propulsion system includes thruster, power processing unit, tank, propellant and general power supply system masses and power consumptions. Relevant performance parameters such as desired thrust, effective exhaust velocity, utilised propellant, and the propulsion type are considered as degrees of freedom. An evolutionary algorithm is applied to the propulsion system scaling model to demonstrate that such evolutionary algorithms are capable of bypassing complex multidimensional design optimisation problems. An evolutionary algorithm is an algorithm that uses a heuristic to change input parameters and a defined selection criterion (e.g., mass fraction of the system) on an optimisation function to refine solutions successively. With sufficient generations and, thereby, iterations of design points, local optima are determined. Using mitigation methods and a sufficient number of seed points, a global optimal system configurations can be found.


1998 ◽  
Vol 16 (1) ◽  
pp. 59-70 ◽  
Author(s):  
Tsion Avital ◽  
Gerald C. Cupchik

A series of four experiments were conducted to examine viewer perceptions of three sets of five nonrepresentational paintings. Increased complexity was embedded in the hierarchical structure of each set by carefully selecting colors and ordering them in each successive painting according to certain rules of transformation which created hierarchies. Experiment 1 supported the hypothesis that subjects would discern the hierarchical complexity underlying the sets of paintings. In Experiment 2 viewers rated the paintings on collative (complexity, disorder) and affective (pleasing, interesting, tension, and power) scales, and a factor analysis revealed that affective ratings were tied to complexity (Factor 1) but not to disorder (Factor 2). In Experiment 3, a measure of exploratory activity (free looking time) was correlated with complexity (Factor 1) but not with disorder (Factor 2). Multidimensional scaling was used in Experiment 4 to examine perceptions of the paintings seen in pairs. Dimension 1 contrasted Soft with Hard-Edged paintings, while Dimension 2 reflected the relative separation of figure from ground in these paintings. Together these results show that untrained viewers can discern hierarchical complexity in paintings and that this quality stimulates affective responses and exploratory activity.


2021 ◽  
Author(s):  
Ahmed Al-Sabaa ◽  
Hany Gamal ◽  
Salaheldin Elkatatny

Abstract The formation porosity of drilled rock is an important parameter that determines the formation storage capacity. The common industrial technique for rock porosity acquisition is through the downhole logging tool. Usually logging while drilling, or wireline porosity logging provides a complete porosity log for the section of interest, however, the operational constraints for the logging tool might preclude the logging job, in addition to the job cost. The objective of this study is to provide an intelligent prediction model to predict the porosity from the drilling parameters. Artificial neural network (ANN) is a tool of artificial intelligence (AI) and it was employed in this study to build the porosity prediction model based on the drilling parameters as the weight on bit (WOB), drill string rotating-speed (RS), drilling torque (T), stand-pipe pressure (SPP), mud pumping rate (Q). The novel contribution of this study is to provide a rock porosity model for complex lithology formations using drilling parameters in real-time. The model was built using 2,700 data points from well (A) with 74:26 training to testing ratio. Many sensitivity analyses were performed to optimize the ANN model. The model was validated using unseen data set (1,000 data points) of Well (B), which is located in the same field and drilled across the same complex lithology. The results showed the high performance for the model either for training and testing or validation processes. The overall accuracy for the model was determined in terms of correlation coefficient (R) and average absolute percentage error (AAPE). Overall, R was higher than 0.91 and AAPE was less than 6.1 % for the model building and validation. Predicting the rock porosity while drilling in real-time will save the logging cost, and besides, will provide a guide for the formation storage capacity and interpretation analysis.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


Sign in / Sign up

Export Citation Format

Share Document