Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data

2021 ◽  
Author(s):  
Xing-Xing Shen ◽  
Jacob L Steenwyk ◽  
Antonis Rokas

Abstract Topological conflict or incongruence is widespread in phylogenomic data. Concatenation- and coalescent-based approaches often result in incongruent topologies, but the causes of this conflict can be difficult to characterize. We examined incongruence stemming from conflict between likelihood-based signal (quantified by the difference in gene-wise log likelihood score or ΔGLS) and quartet-based topological signal (quantified by the difference in gene-wise quartet score or ΔGQS) for every gene in three phylogenomic studies in animals, fungi, and plants, which were chosen because their concatenation-based IQ-TREE (T1) and quartet-based ASTRAL (T2) phylogenies are known to produce eight conflicting internal branches (bipartitions). By comparing the types of phylogenetic signal for all genes in these three data matrices, we found that 30% - 36% of genes in each data matrix are inconsistent, that is, each of these genes has higher log likelihood score for T1 versus T2 (i.e., ΔGLS >0) whereas its T1 topology has lower quartet score than its T2 topology (i.e., ΔGQS <0) or vice versa. Comparison of inconsistent and consistent genes using a variety of metrics (e.g., evolutionary rate, gene tree topology, distribution of branch lengths, hidden paralogy, and gene tree discordance) showed that inconsistent genes are more likely to recover neither T1 nor T2 and have higher levels of gene tree discordance than consistent genes. Simulation analyses demonstrate that removal of inconsistent genes from datasets with low levels of incomplete lineage sorting (ILS) and low and medium levels of gene tree estimation error (GTEE) reduced incongruence and increased accuracy. In contrast, removal of inconsistent genes from datasets with medium and high ILS levels and high GTEE levels eliminated or extensively reduced incongruence, but the resulting congruent species phylogenies were not always topologically identical to the true species trees.

2020 ◽  
Vol 15 (1) ◽  
Author(s):  
Sarah Christensen ◽  
Erin K. Molloy ◽  
Pranjal Vachaspati ◽  
Ananya Yammanuru ◽  
Tandy Warnow

Abstract Motivation Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present. Results Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson−Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.


2022 ◽  
Vol 12 ◽  
Author(s):  
Martha Kandziora ◽  
Petr Sklenář ◽  
Filip Kolář ◽  
Roswitha Schmickl

A major challenge in phylogenetics and -genomics is to resolve young rapidly radiating groups. The fast succession of species increases the probability of incomplete lineage sorting (ILS), and different topologies of the gene trees are expected, leading to gene tree discordance, i.e., not all gene trees represent the species tree. Phylogenetic discordance is common in phylogenomic datasets, and apart from ILS, additional sources include hybridization, whole-genome duplication, and methodological artifacts. Despite a high degree of gene tree discordance, species trees are often well supported and the sources of discordance are not further addressed in phylogenomic studies, which can eventually lead to incorrect phylogenetic hypotheses, especially in rapidly radiating groups. We chose the high-Andean Asteraceae genus Loricaria to shed light on the potential sources of phylogenetic discordance and generated a phylogenetic hypothesis. By accounting for paralogy during gene tree inference, we generated a species tree based on hundreds of nuclear loci, using Hyb-Seq, and a plastome phylogeny obtained from off-target reads during target enrichment. We observed a high degree of gene tree discordance, which we found implausible at first sight, because the genus did not show evidence of hybridization in previous studies. We used various phylogenomic analyses (trees and networks) as well as the D-statistics to test for ILS and hybridization, which we developed into a workflow on how to tackle phylogenetic discordance in recent radiations. We found strong evidence for ILS and hybridization within the genus Loricaria. Low genetic differentiation was evident between species located in different Andean cordilleras, which could be indicative of substantial introgression between populations, promoted during Pleistocene glaciations, when alpine habitats shifted creating opportunities for secondary contact and hybridization.


Author(s):  
Diego F Morales-Briones ◽  
Gudrun Kadereit ◽  
Delphine T Tefarikis ◽  
Michael J Moore ◽  
Stephen A Smith ◽  
...  

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]


2019 ◽  
Author(s):  
Diego F. Morales-Briones ◽  
Gudrun Kadereit ◽  
Delphine T. Tefarikis ◽  
Michael J. Moore ◽  
Stephen A. Smith ◽  
...  

AbstractGene tree discordance in large genomic datasets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The dataset included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations.


2017 ◽  
Author(s):  
Erin K. Molloy ◽  
Tandy Warnow

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.


2020 ◽  
Vol 70 (1) ◽  
pp. 49-66 ◽  
Author(s):  
Paul M Hime ◽  
Alan R Lemmon ◽  
Emily C Moriarty Lemmon ◽  
Elizabeth Prendini ◽  
Jeremy M Brown ◽  
...  

Abstract Molecular phylogenies have yielded strong support for many parts of the amphibian Tree of Life, but poor support for the resolution of deeper nodes, including relationships among families and orders. To clarify these relationships, we provide a phylogenomic perspective on amphibian relationships by developing a taxon-specific Anchored Hybrid Enrichment protocol targeting hundreds of conserved exons which are effective across the class. After obtaining data from 220 loci for 286 species (representing 94% of the families and 44% of the genera), we estimate a phylogeny for extant amphibians and identify gene tree–species tree conflict across the deepest branches of the amphibian phylogeny. We perform locus-by-locus genealogical interrogation of alternative topological hypotheses for amphibian monophyly, focusing on interordinal relationships. We find that phylogenetic signal deep in the amphibian phylogeny varies greatly across loci in a manner that is consistent with incomplete lineage sorting in the ancestral lineage of extant amphibians. Our results overwhelmingly support amphibian monophyly and a sister relationship between frogs and salamanders, consistent with the Batrachia hypothesis. Species tree analyses converge on a small set of topological hypotheses for the relationships among extant amphibian families. These results clarify several contentious portions of the amphibian Tree of Life, which in conjunction with a set of vetted fossil calibrations, support a surprisingly younger timescale for crown and ordinal amphibian diversification than previously reported. More broadly, our study provides insight into the sources, magnitudes, and heterogeneity of support across loci in phylogenomic data sets.[AIC; Amphibia; Batrachia; Phylogeny; gene tree–species tree discordance; genomics; information theory.]


2021 ◽  
Author(s):  
Edeline Gagnon ◽  
Rebecca Hilgenhof ◽  
Andr&eacutes Orejuela ◽  
Angela McDonnell ◽  
Gaurav Sablok ◽  
...  

Increased volumes of phylogenomic data have revealed incongruent topologies in gene trees, both between and within genomes across many organisms. Some of these incongruences indicate polytomies that may remain impossible to resolve. Here, widespread gene-tree discordance is uncovered along the backbone of Solanum, one of the largest flowering plant genera that includes the cultivated potato, tomato, and eggplant, as well as 24 minor crop plants. First, a densely sampled species-level phylogeny of Solanum is built using unpublished and publicly available Sanger sequences comprising 60% of all accepted species (742 spp.) and nine regions (ITS, waxy, and seven plastid markers). The robustness of the Sanger-based topology is tested by examining a plastome dataset with 140 species and a nuclear target-capture dataset with 39 species of Solanum. Clear incongruences between species trees generated from the supermatrix, plastome, and nuclear target-capture datasets are revealed. Discordance within the plastome and target-capture dataset are found at different evolutionary depths in three different areas along the backbone of these phylogenetic trees, with polytomy tests suggesting that most of these nodes have short branches and should be collapsed. We argue that incomplete lineage sorting due to rapid diversification is the most likely cause behind these polytomies, and that embracing the uncertainty that underlies them is crucial to depict the evolution of large and rapidly radiating lineages.


2015 ◽  
Author(s):  
Fabio K. Mendes ◽  
Matthew W. Hahn

Substitution rates are known to be variable among genes, chromosomes, species, and lineages due to multifarious biological processes. Here we consider another source of substitution rate variation due to a technical bias associated with gene tree discordance, which has been found to be rampant in genome-wide datasets, often due to incomplete lineage sorting (ILS). This apparent substitution rate variation is caused when substitutions that occur on discordant gene trees are analyzed in the context of a single, fixed species tree. Such substitutions have to be resolved by proposing multiple substitutions on the species tree, and we therefore refer to this phenomenon as "SPILS" (Substitutions Produced by Incomplete Lineage Sorting). We use simulations to demonstrate that SPILS has a larger effect with increasing levels of ILS, and on trees with larger numbers of taxa. Specific branches of the species trees are consistently, but erroneously, inferred to be longer or shorter, and we show that these branches can be predicted based on discordant tree topologies. Moreover, we observe that fixing a species tree topology when performing tests of positive selection increases the false positive rate, particularly for genes whose discordant topologies are most affected by SPILS. Finally, we use data from multipleDrosophilaspecies to show that SPILS can be detected in nature. While the effects of SPILS are modest per gene, it has the potential to affect substitution rate variation whenever high levels of ILS are present, particularly in rapid radiations. The problems outlined here have implications for character mapping of any type of trait, and for any biological process that causes discordance. We discuss possible solutions to these problems, and areas in which they are likely to have caused faulty inferences of convergence and accelerated evolution.


2020 ◽  
Author(s):  
Liming Cai ◽  
Zhenxiang Xi ◽  
Emily Moriarty Lemmon ◽  
Alan R Lemmon ◽  
Austin Mast ◽  
...  

Abstract The genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order includes nine of the top ten most unstable nodes in angiosperms, which have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 10.0%, 34.8%, and 21.4% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.


2020 ◽  
Author(s):  
Michael J. Sanderson ◽  
Michelle M. McMahon ◽  
Mike Steel

AbstractTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.


Sign in / Sign up

Export Citation Format

Share Document