PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Sebastian Niehus; Hákon Jónsson; Janina Schönberger; Eythór Björnsson; Doruk Beyter; Hannes P. Eggertsson; Patrick Sulem; Kári Stefánsson; Bjarni V. Halldórsson; Birte Kehr

doi:10.1038/s41467-020-20850-5

PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Nature Communications ◽

10.1038/s41467-020-20850-5 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Sebastian Niehus ◽

Hákon Jónsson ◽

Janina Schönberger ◽

Eythór Björnsson ◽

Doruk Beyter ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Sequence Data ◽

Whole Genome Sequence ◽

Medium Size ◽

Phenotypic Traits ◽

Structural Variants ◽

High Confidence ◽

Sequencing Studies ◽

Genomic Structural Variants

AbstractThousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.

Download Full-text

PopDel identifies medium-size deletions jointly in tens of thousands of genomes

10.1101/740225 ◽

2019 ◽

Cited By ~ 2

Author(s):

Sebastian Niehus ◽

Hákon Jónsson ◽

Janina Schönberger ◽

Eythór Björnsson ◽

Doruk Beyter ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Sequence Data ◽

Whole Genome Sequence ◽

Medium Size ◽

Phenotypic Traits ◽

Structural Variants ◽

Novel Approach ◽

Sequencing Studies ◽

Genomic Structural Variants

AbstractThousands of genomic structural variants segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. We describe a novel approach, PopDel, which jointly identifies deletions of about 500 to at least 10,000 bp in length in many genomes together. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6,794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.

Download Full-text

Whole-genome sequence data suggests environmental adaptation of Ethiopian sheep populations

Genome Biology and Evolution ◽

10.1093/gbe/evab014 ◽

2021 ◽

Author(s):

Pamela Wiener ◽

Christelle Robert ◽

Abulgasim Ahbara ◽

Mazdak Salavati ◽

Ayele Abebe ◽

...

Keyword(s):

High Altitude ◽

Environmental Variables ◽

Large Scale ◽

Sequence Data ◽

Strong Association ◽

Environmental Adaptation ◽

Whole Genome Sequence ◽

Single Nucleotide Variants ◽

High Altitude Adaptation ◽

Altitude Adaptation

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.

Download Full-text

Genotyping structural variants in pangenome graphs using the vg toolkit

10.1101/654566 ◽

2019 ◽

Cited By ~ 7

Author(s):

Glenn Hickey ◽

David Heller ◽

Jean Monlong ◽

Jonas A. Sibbesen ◽

Jouni Sirén ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Effective Means ◽

Point Mutations ◽

Structural Variants ◽

Short Read ◽

Yeast Strains ◽

Sequencing Studies ◽

Long Read

AbstractStructural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

Download Full-text

Selective ancestral sorting and de novo evolution in the agricultural invasion of Amaranthus tuberculatus

10.1101/2021.07.26.453853 ◽

2021 ◽

Author(s):

Julia M. Kreiner ◽

Amalia Caballero ◽

Stephen I. Wright ◽

John R. Stinchcombe

Keyword(s):

De Novo ◽

Sequence Data ◽

Sex Differentiation ◽

Common Garden ◽

Whole Genome Sequence ◽

Secondary Contact ◽

Natural Environments ◽

Relative Role ◽

Standing Variation ◽

Amaranthus Tuberculatus

The relative role of hybridization, de novo evolution, and standing variation in weed adaptation to agricultural environments is largely unknown. In Amaranthus tuberculatus, a widespread North American agricultural weed, adaptation is likely influenced by recent secondary contact and admixture of two previously isolated subspecies. We characterized the extent of adaptation and phenotypic differentiation accompanying the spread of A. tuberculatus into agricultural environments and the contribution of subspecies divergence. We generated phenotypic and whole-genome sequence data from a manipulative common garden experiment, using paired samples from natural and agricultural populations. We found strong latitudinal, longitudinal, and sex differentiation in phenotypes, and subtle differences among agricultural and natural environments that were further resolved with ancestry-based inference. The transition into agricultural environments has favoured southwestern var. rudis ancestry that leads to higher biomass and environment-specific phenotypes: increased biomass and earlier flowering under reduced water availability, and reduced plasticity in fitness-related traits. We also detected de novo adaptation to agricultural habitats independent of ancestry effects, including marginally higher biomass and later flowering in agricultural populations, and a time to germination home advantage. Therefore, the invasion of A. tuberculatus into agricultural environments has drawn on adaptive variation across multiple timescales—through both preadaptation via the preferential sorting of var. rudis ancestry and de novo local adaptation.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220v1 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

The Genome Sequence of the Anthelmintic-Susceptible New Zealand Haemonchus contortus

Genome Biology and Evolution ◽

10.1093/gbe/evz141 ◽

2019 ◽

Vol 11 (7) ◽

pp. 1965-1970 ◽

Cited By ~ 7

Author(s):

Nikola Palevich ◽

Paul H Maclean ◽

Abdul Baten ◽

Richard W Scott ◽

David M Leathwick

Keyword(s):

Genome Sequence ◽

Molecular Mechanisms ◽

De Novo ◽

Sequence Data ◽

Animal Health ◽

Draft Genome ◽

Whole Genome Sequence ◽

Parasitic Nematodes ◽

Hybrid Assembly ◽

Genetic Structures

Abstract Internal parasitic nematodes are a global animal health issue causing drastic losses in livestock. Here, we report a H. contortus representative draft genome to serve as a genetic resource to the scientific community and support future experimental research of molecular mechanisms in related parasites. A de novo hybrid assembly was generated from PCR-free whole genome sequence data, resulting in a chromosome-level assembly that is 465 Mb in size encoding 22,341 genes. The genome sequence presented here is consistent with the genome architecture of the existing Haemonchus species and is a valuable resource for future studies regarding population genetic structures of parasitic nematodes. Additionally, comparative pan-genomics with other species of economically important parasitic nematodes have revealed highly open genomes and strong collinearities within the phylum Nematoda.

Download Full-text

Characterizing mutagenic effects of recombination through a sequence-level genetic map

Science ◽

10.1126/science.aau1043 ◽

2019 ◽

Vol 363 (6425) ◽

pp. eaau1043 ◽

Cited By ~ 62

Author(s):

Bjarni V. Halldorsson ◽

Gunnar Palsson ◽

Olafur A. Stefansson ◽

Hakon Jonsson ◽

Marteinn T. Hardarson ◽

...

Keyword(s):

Genetic Map ◽

Meiotic Recombination ◽

De Novo ◽

Sequence Data ◽

Mutagenic Effect ◽

Whole Genome Sequence ◽

De Novo Mutation ◽

Base Pairs ◽

Males And Females ◽

Mutagenic Effects

Genetic diversity arises from recombination and de novo mutation (DNM). Using a combination of microarray genotype and whole-genome sequence data on parent-child pairs, we identified 4,531,535 crossover recombinations and 200,435 DNMs. The resulting genetic map has a resolution of 682 base pairs. Crossovers exhibit a mutagenic effect, with overrepresentation of DNMs within 1 kilobase of crossovers in males and females. In females, a higher mutation rate is observed up to 40 kilobases from crossovers, particularly for complex crossovers, which increase with maternal age. We identified 35 loci associated with the recombination rate or the location of crossovers, demonstrating extensive genetic control of meiotic recombination, and our results highlight genes linked to the formation of the synaptonemal complex as determinants of crossovers.

Download Full-text

Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants

Nature Communications ◽

10.1038/s41467-019-12174-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 12

Author(s):

Yaoxi He ◽

Xin Luo ◽

Bin Zhou ◽

Ting Hu ◽

Xiaoyu Meng ◽

...

Keyword(s):

Rhesus Macaque ◽

Genome Assembly ◽

De Novo ◽

Gene Annotation ◽

Large Body ◽

Phenotypic Traits ◽

Structural Variants ◽

De Novo Genome Assembly ◽

Chinese Rhesus Macaque ◽

Long Read

Abstract We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.

Download Full-text

Rapid low-cost assembly of the Drosophila melanogaster reference genome using low-coverage, long-read sequencing

10.1101/267401 ◽

2018 ◽

Cited By ~ 6

Author(s):

Edwin A. Solares ◽

Mahul Chakraborty ◽

Danny E. Miller ◽

Shannon Kalsow ◽

Kate Hall ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Genetic Variation ◽

Large Scale ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Nucleotide Polymorphisms ◽

Structural Variants ◽

High Coverage ◽

Reference Assembly

ABSTRACTAccurate and comprehensive characterization of genetic variation is essential for deciphering the genetic basis of diseases and other phenotypes. A vast amount of genetic variation stems from large-scale sequence changes arising from the duplication, deletion, inversion, and translocation of sequences. In the past 10 years, high-throughput short reads have greatly expanded our ability to assay sequence variation due to single nucleotide polymorphisms. However, a recent de novo assembly of a second Drosophila melanogaster reference genome has revealed that short read genotyping methods miss hundreds of structural variants, including those affecting phenotypes. While genomes assembled using high-coverage long reads can achieve high levels of contiguity and completeness, concerns about cost, errors, and low yield have limited widespread adoption of such sequencing approaches. Here we resequenced the reference strain of D. melanogaster (ISO1) on a single Oxford Nanopore MinION flow cell run for 24 hours. Using only reads longer than 1 kb or with at least 30x coverage, we assembled a highly contiguous de novo genome. The addition of inexpensive paired reads and subsequent scaffolding using an optical map technology achieved an assembly with completeness and contiguity comparable to the D. melanogaster reference assembly. Comparison of our assembly to the reference assembly of ISO1 uncovered a number of structural variants (SVs), including novel LTR transposable element insertions and duplications affecting genes with developmental, behavioral, and metabolic functions. Collectively, these SVs provide a snapshot of the dynamics of genome evolution. Furthermore, our assembly and comparison to the D. melanogaster reference genome demonstrates that high-quality de novo assembly of reference genomes and comprehensive variant discovery using such assemblies are now possible by a single lab for under $1,000 (USD).

Download Full-text

Finding functional disease-associated non-coding variation using next-generation sequencing

10.1101/060285 ◽

2016 ◽

Author(s):

Paolo Devanna ◽

Xiaowei Sylvia Chen ◽

Joses Ho ◽

Dario Gajewski ◽

Alessandro Gialluisi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Binding Sites ◽

Large Scale ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Whole Exome ◽

Generation Sequencing

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.

Download Full-text