scholarly journals Identification and Quantification of Genomic Repeats and Sample Contamination in Assemblies of 454 Pyrosequencing Reads

Sequencing ◽  
2010 ◽  
Vol 2010 ◽  
pp. 1-12 ◽  
Author(s):  
Alexander J. Nederbragt ◽  
Trine Ballestad Rounge ◽  
Kyrre L. Kausrud ◽  
Kjetill S. Jakobsen

Contigs assembled from 454 reads from bacterial genomes demonstrate a range of read depths, with a number of contigs having a depth that is far higher than can be expected. For reference genome sequence datasets, there exists a high correlation between the contig specific read depth and the number of copies present in the genome. We developed a sequence of applied statistical analyses, which suggest that the number of copies present can be reliably estimated based on the read depth distribution in de novo genome assemblies. Read depths of contigs of de novo cyanobacterial genome assemblies were determined, and several high read depth contigs were identified. These contigs were shown to mainly contain genes that are known to be present in multiple copies in bacterial genomes. For these assemblies, a correlation between read depth and copy number was experimentally demonstrated using real-time PCR. Copy number estimates, obtained using the statistical analysis developed in this work, are presented. Per-contig read depth analysis of assemblies based on 454 reads therefore enables de novo detection of genomic repeats and estimation of the copy number of these repeats. Additionally, our analysis efficiently identified contigs stemming from sample contamination, allowing for their removal from the assembly.

2015 ◽  
Vol 172 (6) ◽  
pp. 803-811 ◽  
Author(s):  
Maya B Lodish ◽  
Bo Yuan ◽  
Isaac Levy ◽  
Glenn D Braunstein ◽  
Charalampos Lyssikatos ◽  
...  

ObjectiveWe have recently reported five patients with bilateral adrenocortical hyperplasia (BAH) and Cushing's syndrome (CS) caused by constitutive activation of the catalytic subunit of protein kinase A (PRKACA). By doing new in-depth analysis of their cytogenetic abnormality, we attempted a better genotype–phenotype correlation of theirPRKACAamplification.DesignThis study is a case series.MethodsMolecular cytogenetic, genomic, clinical, and histopathological analyses were performed in five patients with CS.ResultsReinvestigation of the defects of previously described patients by state-of-the-art molecular cytogenetics showed complex genomic rearrangements in the chromosome 19p13.2p13.12 locus, resulting in copy number gains encompassing the entirePRKACAgene; three patients (one sporadic case and two related cases) were observed with gains consistent with duplications, while two sporadic patients were observed with gains consistent with triplications. Although all five patients presented with ACTH-independent CS, the three sporadic patients had micronodular BAH and underwent bilateral adrenalectomy in early childhood, whereas the two related patients, a mother and a son, presented with macronodular BAH as adults. In at least one patient,PRKACAtriplication was associated with a more severe phenotype.ConclusionsConstitutional chromosomalPRKACAgene amplification is a recently identified genetic defect associated with CS, a trait that may be inherited in an autosomal dominant manner or occurde novo. Genomic rearrangements can be complex and can result in different copy number states of dosage-sensitive genes, e.g., duplication and triplication.PRKACAamplification can lead to variable phenotypes clinically and pathologically, both micro- and macro-nodular BAH, the latter of which we speculate may depend on the extent of amplification.


2014 ◽  
Author(s):  
Sean D Smith ◽  
Joseph K Kawash ◽  
Andrey Grigoriev

Amplifications or deletions of genome segments, known as copy number variants (CNVs), have been associated with many diseases. Read depth analysis of next-generation sequencing (NGS) is an essential method of detecting CNVs. However, genome read coverage is frequently distorted by various biases of NGS platforms, which reduce predictive capabilities of existing approaches. Additionally, the use of read depth tools has been somewhat hindered by imprecise breakpoint identification. We developed GROM-RD, an algorithm that analyzes multiple biases in read coverage to detect CNVs in NGS data. We found non-uniform variance across distinct GC regions after using existing GC bias correction methods and developed a novel approach to normalize such variance. Although complex and repetitive genome segments complicate CNV detection, GROM-RD adjusts for repeat bias and uses a two-pipeline masking approach to detect CNVs in complex and repetitive segments while improving sensitivity in less complicated regions. To overcome a typical weakness of RD methods, GROM-RD employs a CNV search using size-varying overlapping windows to improve breakpoint resolution. We compared our method to two widely used programs based on read depth methods, CNVnator and RDXplorer, and observed improved CNV detection and breakpoint accuracy for GROM-RD. GROM-RD is available at http://grigoriev.rutgers.edu/software/


2015 ◽  
Author(s):  
Steven H Wu ◽  
Rachel S Schwartz ◽  
David J Winter ◽  
Don Conrad ◽  
Reed A Cartwright

Motivation: Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. Because de novo mutations are rare, even low levels of genotyping error can cause a large fraction of false positive de novo mutations. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error, and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity region. We expect that this approach to modeling the distribution of NGS data, will lead to improved genotyping. For example, this approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.


2018 ◽  
Author(s):  
Michael Schmid ◽  
Daniel Frei ◽  
Andrea Patrignani ◽  
Ralph Schlapbach ◽  
Jürg E. Frey ◽  
...  

AbstractGenerating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length. Beyond long repeats, the P19E3 assembly was further complicated by a shufflon region. Its complex genome could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from the Oxford Nanopore Technology. Another important factor for a full genomic resolution was the choice of assembly algorithm.Importantly, a repeat analysis indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this dark matter for de novo genome assembly of prokaryotes. Several of these dark matter genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assemblers capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.


2014 ◽  
Author(s):  
Sean D Smith ◽  
Joseph K Kawash ◽  
Andrey Grigoriev

Amplifications or deletions of genome segments, known as copy number variants (CNVs), have been associated with many diseases. Read depth analysis of next-generation sequencing (NGS) is an essential method of detecting CNVs. However, genome read coverage is frequently distorted by various biases of NGS platforms, which reduce predictive capabilities of existing approaches. Additionally, the use of read depth tools has been somewhat hindered by imprecise breakpoint identification. We developed GROM-RD, an algorithm that analyzes multiple biases in read coverage to detect CNVs in NGS data. We found non-uniform variance across distinct GC regions after using existing GC bias correction methods and developed a novel approach to normalize such variance. Although complex and repetitive genome segments complicate CNV detection, GROM-RD adjusts for repeat bias and uses a two-pipeline masking approach to detect CNVs in complex and repetitive segments while improving sensitivity in less complicated regions. To overcome a typical weakness of RD methods, GROM-RD employs a CNV search using size-varying overlapping windows to improve breakpoint resolution. We compared our method to two widely used programs based on read depth methods, CNVnator and RDXplorer, and observed improved CNV detection and breakpoint accuracy for GROM-RD. GROM-RD is available at http://grigoriev.rutgers.edu/software/


2017 ◽  
Vol 2 ◽  
pp. 49
Author(s):  
Andrew Parrish ◽  
Richard Caswell ◽  
Garan Jones ◽  
Christopher M. Watson ◽  
Laura A. Crinnion ◽  
...  

Copy number variants (CNV) are a major cause of disease, with over 30,000 reported in the DECIPHER database. To use read depth data from targeted Next Generation Sequencing (NGS) panels to identify CNVs with the highest degree of sensitivity, it is necessary to account for biases inherent in the data. GC content and ambiguous mapping due to repetitive sequence elements and pseudogenes are the principal components of technical variability. In addition, the algorithms used favour the detection of multi-exon CNVs, and rely on suitably matched normal dosage samples for comparison. We developed a calling strategy that subdivides target intervals, and uses pools of historical control samples to overcome these limitations in a clinical diagnostic laboratory. We compared our enhanced strategy with an unmodified pipeline using the R software package ExomeDepth, using a cohort of 109 heterozygous CNVs (91 deletions, 18 duplications in 26 genes), including 25 single exon CNVs. The unmodified pipeline detected 104/109 CNVs, giving a sensitivity of 89.62% to 98.49% at the 95% confidence interval. The detection of all 109 CNVs by our enhanced method demonstrates 95% confidence the sensitivity is ≥96.67%, allowing NGS read depth analysis to be used for CNV detection in a clinical diagnostic setting.


2016 ◽  
Author(s):  
Ryan R. Wick ◽  
Louise M. Judd ◽  
Claire L. Gorrie ◽  
Kathryn E. Holt

1.AbstractThe Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce more complete genome assemblies, but the sequencing is more expensive and error prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate “hybrid” assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler utilises a novel semi-global aligner, which is used to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.


Sign in / Sign up

Export Citation Format

Share Document