scholarly journals Whole-Proteome Tree of Insects: An Information-theory-based “alignment-free” phylogeny and grouping of “proteome books”

2020 ◽  
Author(s):  
JaeJin Choi ◽  
Byung-Ju Kim ◽  
Sung-Hou Kim

AbstractBackgroundAn “organism tree” of insects, the largest and most species-diverse group of all living animals, can be considered as a metaphorical and conceptual tree to capture a simplified narrative of the complex and unpredictable evolutionary courses of the extant insects. Currently, the most common approach has been to construct a “gene tree”, as a surrogate for the organism tree, by selecting a group of highly alignable regions of each of the select genes/proteins to represent each organism. However, such selected regions account for a small fraction of all genes/proteins and even smaller fraction of whole genome of an organism. During last decades, whole-genome sequences of many extant insects became available, providing an opportunity to construct a “whole-genome or whole-proteome tree” of insects using Information Theory without sequence alignment (alignment-free method).ResultsA whole-proteome tree of the insects shows that (a) the demographic grouping-pattern is similar to those in the gene trees, but there are notable differences in the branching orders of the groups, thus, the sisterhood relationships between pairs of the groups; and (b) all the founders of the major groups have emerged in an “explosive burst” near the root of the tree.ConclusionSince the whole-proteome sequence of an organism can be considered as a “book” of amino-acid alphabets, a tree of the books can be constructed, without alignment of sequences, using a text analysis method of Information Theory. Such tree provides an alternative view-point of constructing a narrative of evolution and kinship among the extant insects.

2020 ◽  
Author(s):  
Sung-Hou Kim ◽  
JaeJin Choi ◽  
Byung-Ju Kim

Abstract Background: An “organism tree” of a group of extant-organisms can be considered as a conceptual tree to capture a simplified narrative of the evolutionary course among the organisms. Due to the difficulties of whole-genome sequencing for many organisms, the most common approach has been to construct a “gene tree” by selecting a group of genes common among the organisms, “align” each gene family and estimate evolutionary distances. Despite broad acceptance of the gene trees as the surrogates for the organism trees, there are important limitations and confounding issues with the approach. During last decades, whole-genome sequences of many extant-arthropods became available, providing an opportunity to construct a “whole-proteome tree” of the arthropods, the largest and most species-diverse group of all living animals. Results: An “alignment-free” whole-proteome tree of the arthropods shows that (a) the demographic grouping-pattern is similar to those in the gene trees, but there are notable differences in the branching orders of the groups and the sisterhood relationships between pairs of the groups; and (b) almost all the “founders” of the groups have emerged in an “explosive burst” near the root of the tree. Conclusion: Since the whole-proteome sequence of an organism can be considered as a “book” of amino-acid alphabets, a tree of the books can be constructed, without alignment of sequences, using a text analysis method of Information Theory, which allows comparing the information content of whole-proteomes. Such tree provides another view-point to consider in telling the narrative of kinship among the arthropods.


2020 ◽  
Vol 117 (7) ◽  
pp. 3678-3686 ◽  
Author(s):  
JaeJin Choi ◽  
Sung-Hou Kim

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.


Author(s):  
Hsin-Hsiung Huang ◽  
Senthil Balaji Girimurugan

Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.


2017 ◽  
Author(s):  
Matthew Parks ◽  
Teofil Nakov ◽  
Elizabeth Ruck ◽  
Norman J. Wickett ◽  
Andrew J. Alverson

ABSTRACTPremise of the studyDiatoms are one of the most species-rich lineages of microbial eukaryotes. Similarities in clade age, species richness, and contributions to primary production motivate comparisons to flowering plants, whose genomes have been inordinately shaped by whole genome duplication (WGD). These events that have been linked to speciation and increased rates of lineage diversification, identifying WGDs as a principal driver of angiosperm evolution. We synthesized a relatively large but scattered body of evidence that, taken together, suggests that polyploidy may be common in diatoms.MethodsWe used data from gene counts, gene trees, and patterns of synonymous divergence to carry out the first large-scale phylogenomic analysis of genome-scale duplication histories for a phylogenetically diverse set of 37 diatom taxa.Key resultsSeveral methods identified WGD events of varying age across diatoms, though determining the exact number and placement of events and, more broadly, inferences of WGD at all, were greatly impacted by gene-tree uncertainty. Gene-tree reconciliations supported allopolyploidy as the predominant mode of polyploid formation, with particularly strong evidence for ancient allopolyploid events in the thalassiosiroid and pennate diatom clades.ConclusionsWhole genome duplication appears to have been an important driver of genome evolution in diatoms. Denser taxon sampling will better pinpoint the timing of WGDs and likely reveal many more of them. We outline potential challenges in reconstructing paleopolyploid events in diatoms that, together with these results, offer a framework for understanding the evolutionary roles of genome duplication in a group that likely harbors substantial genomic diversity.


2019 ◽  
Vol 68 (6) ◽  
pp. 937-955 ◽  
Author(s):  
Alison Cloutier ◽  
Timothy B Sackton ◽  
Phil Grayson ◽  
Michele Clamp ◽  
Allan J Baker ◽  
...  

Abstract Palaeognathae represent one of the two basal lineages in modern birds, and comprise the volant (flighted) tinamous and the flightless ratites. Resolving palaeognath phylogenetic relationships has historically proved difficult, and short internal branches separating major palaeognath lineages in previous molecular phylogenies suggest that extensive incomplete lineage sorting (ILS) might have accompanied a rapid ancient divergence. Here, we investigate palaeognath relationships using genome-wide data sets of three types of noncoding nuclear markers, together totaling 20,850 loci and over 41 million base pairs of aligned sequence data. We recover a fully resolved topology placing rheas as the sister to kiwi and emu + cassowary that is congruent across marker types for two species tree methods (MP-EST and ASTRAL-II). This topology is corroborated by patterns of insertions for 4274 CR1 retroelements identified from multispecies whole-genome screening, and is robustly supported by phylogenomic subsampling analyses, with MP-EST demonstrating particularly consistent performance across subsampling replicates as compared to ASTRAL. In contrast, analyses of concatenated data supermatrices recover rheas as the sister to all other nonostrich palaeognaths, an alternative that lacks retroelement support and shows inconsistent behavior under subsampling approaches. While statistically supporting the species tree topology, conflicting patterns of retroelement insertions also occur and imply high amounts of ILS across short successive internal branches, consistent with observed patterns of gene tree heterogeneity. Coalescent simulations and topology tests indicate that the majority of observed topological incongruence among gene trees is consistent with coalescent variation rather than arising from gene tree estimation error alone, and estimated branch lengths for short successive internodes in the inferred species tree fall within the theoretical range encompassing the anomaly zone. Distributions of empirical gene trees confirm that the most common gene tree topology for each marker type differs from the species tree, signifying the existence of an empirical anomaly zone in palaeognaths.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5254 ◽  
Author(s):  
James M. Wainaina ◽  
Elijah Ateka ◽  
Timothy Makori ◽  
Monica A. Kehoe ◽  
Laura M. Boykin

Sweet potato is a major food security crop within sub-Saharan Africa where 90% of Africa production occurs. One of the major limitations of sweet potato production are viral infections. In this study, we used a combination of whole genome sequences from a field isolate obtained from Kenya and those available in GenBank. Sequences of four sweet potato viruses: Sweet potato feathery mottle virus (SPFMV), Sweet potato virus C (SPVC), Sweet potato chlorotic stunt virus (SPCSV), Sweet potato chlorotic fleck virus (SPCFV) were obtained from the Kenyan sample. SPFMV sequences both from this study and from GenBank were found to be recombinant. Recombination breakpoints were found within the Nla-Pro, coat protein and P1 genes. The SPCSV, SPVC, and SPCFV viruses from this study were non-recombinant. Bayesian phylogenomic relationships across whole genome trees showed variation in the number of well-supported clades; within SPCSV (RNA1 and RNA2) and SPFMV two well-supported clades (I and II) were resolved. The SPCFV tree resolved three well-supported clades (I–III) while four well-supported clades were resolved in SPVC (I–IV). Similar clades were resolved within the coalescent species trees. However, there were disagreements between the clades resolved in the gene trees compared to those from the whole genome tree and coalescent species trees. However the coat protein gene tree of SPCSV and SPCFV resolved similar clades to the genome and coalescent species tree while this was not the case in SPFMV and SPVC. In addition, we report variation in selective pressure within sites of individual genes across all four viruses; overall all viruses were under purifying selection. We report the first complete genomes of SPFMV, SPVC, SPCFV, and a partial SPCSV from Kenya as a mixed infection in one sample. Our findings provide a snap shot on the evolutionary relationship of sweet potato viruses (SPFMV, SPVC, SPCFV, and SPCSV) from Kenya as well as assessing whether selection pressure has an effect on their evolution.


2017 ◽  
Vol 114 (35) ◽  
pp. 9391-9396 ◽  
Author(s):  
JaeJin Choi ◽  
Sung-Hou Kim

Fungi belong to one of the largest and most diverse kingdoms of living organisms. The evolutionary kinship within a fungal population has so far been inferred mostly from the gene-information–based trees (“gene trees”), constructed commonly based on the degree of differences of proteins or DNA sequences of a small number of highly conserved genes common among the population by a multiple sequence alignment (MSA) method. Since each gene evolves under different evolutionary pressure and time scale, it has been known that one gene tree for a population may differ from other gene trees for the same population depending on the subjective selection of the genes. Within the last decade, a large number of whole-genome sequences of fungi have become publicly available, which represent, at present, the most fundamental and complete information about each fungal organism. This presents an opportunity to infer kinship among fungi using a whole-genome information-based tree (“genome tree”). The method we used allows comparison of whole-genome information without MSA, and is a variation of a computational algorithm developed to find semantic similarities or plagiarism in two books, where we represent whole-genomic information of an organism as a book of words without spaces. The genome tree reveals several significant and notable differences from the gene trees, and these differences invoke new discussions about alternative narratives for the evolution of some of the currently accepted fungal groups.


2013 ◽  
Vol 30 (5) ◽  
pp. 1032-1037 ◽  
Author(s):  
Jinkui Cheng ◽  
Fuliang Cao ◽  
Zhihua Liu

Abstract Phylogenetic analysis based on alignment method meets huge challenges when dealing with whole-genome sequences, for example, recombination, shuffling, and rearrangement of sequences. Thus, various alignment-free methods for phylogeny construction have been proposed. However, most of these methods have not been implemented as tools or web servers. Researchers cannot use these methods easily with their data sets. To facilitate the usage of various alignment-free methods, we implemented most of the popular alignment-free methods and constructed a user-friendly web server for alignment-free genome phylogeny (AGP). AGP integrated the phylogenetic tree construction, visualization, and comparison functions together. Both AGP and all source code of the methods are available at http://www.herbbol.org:8000/agp (last accessed February 26, 2013). AGP will facilitate research in the field of whole-genome phylogeny and comparison.


2012 ◽  
Vol 93 (11) ◽  
pp. 2326-2336 ◽  
Author(s):  
S. J. Lycett ◽  
G. Baillie ◽  
E. Coulter ◽  
S. Bhatt ◽  
P. Kellam ◽  
...  

Swine have often been considered as a mixing vessel for different influenza strains. In order to assess their role in more detail, we undertook a retrospective sequencing study to detect and characterize the reassortants present in European swine and to estimate the rate of reassortment between H1N1, H1N2 and H3N2 subtypes with Eurasian (avian-like) internal protein-coding segments. We analysed 69 newly obtained whole genome sequences of subtypes H1N1–H3N2 from swine influenza viruses sampled between 1982 and 2008, using Illumina and 454 platforms. Analyses of these genomes, together with previously published genomes, revealed a large monophyletic clade of Eurasian swine-lineage polymerase segments containing H1N1, H1N2 and H3N2 subtypes. We subsequently examined reassortments between the haemagglutinin and neuraminidase segments and estimated the reassortment rates between lineages using a recently developed evolutionary analysis method. High rates of reassortment between H1N2 and H1N1 Eurasian swine lineages were detected in European strains, with an average of one reassortment every 2–3 years. This rapid reassortment results from co-circulating lineages in swine, and in consequence we should expect further reassortments between currently circulating swine strains and the recent swine-origin H1N1v pandemic strain.


2018 ◽  
Author(s):  
Alison Cloutier ◽  
Timothy B. Sackton ◽  
Phil Grayson ◽  
Michele Clamp ◽  
Allan J. Baker ◽  
...  

AbstractPalaeognathae represent one of the two basal lineages in modern birds, and comprise the volant (flighted) tinamous and the flightless ratites. Resolving palaeognath phylogenetic relationships has historically proved difficult, and short internal branches separating major palaeognath lineages in previous molecular phylogenies suggest that extensive incomplete lineage sorting (ILS) might have accompanied a rapid ancient divergence. Here, we investigate palaeognath relationships using genome-wide data sets of three types of noncoding nuclear markers, together totalling 20,850 loci and over 41 million base pairs of aligned sequence data. We recover a fully resolved topology placing rheas as the sister to kiwi and emu + cassowary that is congruent across marker types for two species tree methods (MP-EST and ASTRAL-II). This topology is corroborated by patterns of insertions for 4,274 CR1 retroelements identified from multi-species whole genome screening, and is robustly supported by phylogenomic subsampling analyses, with MP-EST demonstrating particularly consistent performance across subsampling replicates as compared to ASTRAL. In contrast, analyses of concatenated data supermatrices recover rheas as the sister to all other non-ostrich palaeognaths, an alternative that lacks retroelement support and shows inconsistent behavior under subsampling approaches. While statistically supporting the species tree topology, conflicting patterns of retroelement insertions also occur and imply high amounts of ILS across short successive internal branches, consistent with observed patterns of gene tree heterogeneity. Coalescent simulations indicate that the majority of observed topological incongruence among gene trees is consistent with coalescent variation rather than arising from gene tree estimation error alone, and estimated branch lengths for short successive internodes in the inferred species tree fall within the theoretical range encompassing the anomaly zone. Distributions of empirical gene trees confirm that the most common gene tree topology for each marker type differs from the species tree, signifying the existence of an empirical anomaly zone in palaeognaths.


Sign in / Sign up

Export Citation Format

Share Document