Large-scale Integrative Taxonomy (LIT): resolving the data conundrum for dark taxa
AbstractNew, rapid, accurate, scalable, and cost-effective species discovery and delimitation methods are needed for tackling “dark taxa”, that we here define as clades for which <10% of all species are described and the estimated diversity exceeds 1000 species. Species delimitation should be based on multiple data sources (“integrative taxonomy”) but collecting several types of data for the same specimens risks impeding the discovery process that is already too slow. We here show how this can be avoided with Large-scale Integrative Taxonomy (LIT). Preliminary species hypotheses are generated based on inexpensive data that are obtained quickly and cost-effectively in a technical exercise. The validation step is then based on a more expensive type of data that are only obtained for few specimens selected based on objective criteria. We here use this approach to sort 18 000 scuttle flies (Diptera: Phoridae) from Sweden into 315 preliminary species hypotheses based on NGS barcode (313bp) clusters. These clusters went through subsequent validation based on morphology and were then used to develop quantitative indicators for predicting which barcode clusters are in conflict with morphospecies. For this purpose, we first randomly selected 100 clusters for in-depth validation with morphology. Afterwards, we used a linear model to demonstrate that the best predictors for conflict between barcode clusters and morphology are maximum p-distance within the cluster and cluster stability across different clustering thresholds. A test of these indicators using the 215 remaining clusters reveals that these predictors correctly identify all clusters that conflict with morphology. The morphological validation step in our study involved just 1 039 specimens (5.8% of all specimens), but a newly proposed simplified protocol would only require the study of 915 (5.1%: 2.5 specimens per species) as we show that clusters without signatures of incongruence can be validated by only studying two specimens representing the most divergent haplotypes. To test the generality of our results across different barcode clustering techniques, we establish that the levels of conflict are similar across Objective Clustering (OC), Automatic Barcode Gap Discovery (ABGD), Poisson Tree Processes (PTP) and Refined Single Linkage (RESL) (used by Barcode of Life Data System (BOLD) to assign Barcode Index Numbers (BINs)). OC and ABGD achieved a maximum match score with morphology of 89% while PTP was slightly less effective (84%). RESL could only be tested for a subset of the specimens because the algorithm is not public. BINs based on 277 of the original 1 714 haplotypes were 86% congruent with morphology while the values were 89% for OC, 74% for PTP, and 72% for ABGD.