Methods developed during the first National Center for Biotechnology Information Structural Variation Codeathon at Baylor College of Medicine

F1000Research ◽

10.12688/f1000research.23773.1 ◽

2020 ◽

Vol 9 ◽

pp. 1141

Author(s):

Medhat Mahmoud ◽

Alejandro Rafael Gener ◽

Michael M. Khayat ◽

Adam C. English ◽

Advait Balaji ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Bioinformatic Analysis ◽

Baylor College ◽

Wide Range ◽

Number Variation ◽

Working Groups ◽

Annual Working ◽

Next Generation Sequencing Ngs ◽

Ngs Data

In October 2019, 46 scientists from around the world participated in the first National Center for Biotechnology Information (NCBI) Structural Variation (SV) Codeathon at Baylor College of Medicine. The charge of this first annual working session was to identify ongoing challenges around the topics of SV and graph genomes, and in response to design reliable methods to facilitate their study. Over three days, seven working groups each designed and developed new open-sourced methods to improve the bioinformatic analysis of genomic SVs represented in next-generation sequencing (NGS) data. The groups’ approaches addressed a wide range of problems in SV detection and analysis, including quality control (QC) assessments of metagenome assemblies and population-scale VCF files, de novo copy number variation (CNV) detection based on continuous long sequence reads, the representation of sequence variation using graph genomes, and the development of an SV annotation pipeline. A summary of the questions and developments that arose during the daily discussions between groups is outlined. The new methods are publicly available at https://github.com/NCBI-Codeathons/, and demonstrate that a codeathon devoted to SV analysis can produce valuable new insights both for participants and for the broader research community.

Download Full-text

DiscoSnp++: de novo detection of small variants from raw unassembled read set(s)

10.1101/209965 ◽

2017 ◽

Cited By ~ 12

Author(s):

Pierre Peterlongo ◽

Chloé Riou ◽

Erwan Drezen ◽

Claire Lemaitre

Keyword(s):

Reference Genome ◽

De Novo ◽

Model Organisms ◽

Small Indels ◽

Desktop Computers ◽

Resource Requirements ◽

Computational Resources ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Source Of Information

AbstractMotivationNext Generation Sequencing (NGS) data provide an unprecedented access to life mechanisms. In particular, these data enable to detect polymorphisms such as SNPs and indels. As these polymorphisms represent a fundamental source of information in agronomy, environment or medicine, their detection in NGS data is now a routine task. The main methods for their prediction usually need a reference genome. However, non-model organisms and highly divergent genomes such as in cancer studies are extensively investigated.ResultsWe propose DiscoSnp++, in which we revisit the DiscoSnp algorithm. DiscoSnp++ is designed for detecting and ranking all kinds of SNPs and small indels from raw read set(s). It outputs files in fasta and VCF formats. In particular, predicted variants can be automatically localized afterwards on a reference genome if available. Its usage is extremely simple and its low resource requirements make it usable on common desktop computers. Results show that DiscoSnp++ performs better than state-of-the-art methods in terms of computational resources and in terms of results quality. An important novelty is the de novo detection of indels, for which we obtained 99% precision when calling indels on simulated human datasets and 90% recall on high confident indels from the Platinum dataset.LicenseGNU Affero general public licenseAvailabilityhttps://github.com/GATB/[email protected]

Download Full-text

HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data

10.1101/220830 ◽

2017 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Serafim Batzoglou ◽

Arend Sidow ◽

Lu Zhang

Keyword(s):

False Positive ◽

De Novo ◽

False Positives ◽

Sequencing Data ◽

De Novo Mutations ◽

Congenital Diseases ◽

Genome Wide ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Haplotype Information

AbstractBackgroundDe novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.ResultsTo address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM.HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80% to 99% of false positives regardless of how large the candidate DNM set is.ConclusionsHAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.

Download Full-text

Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses

Viruses ◽

10.3390/v12070758 ◽

2020 ◽

Vol 12 (7) ◽

pp. 758 ◽

Cited By ~ 1

Author(s):

Keylie M. Gibson ◽

Margaret C. Steiner ◽

Uzma Rentia ◽

Matthew L. Bendall ◽

Marcos Pérez-Losada ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Consensus Sequence ◽

Sequence Assembly ◽

Next Generation ◽

Consensus Sequences ◽

Bioinformatic Tools ◽

Hiv Gp120 ◽

Next Generation Sequencing Ngs ◽

Ngs Data

Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.

Download Full-text

Retrospective review of subsequent treatments (tx) for esophagogastric adenocarcinomas (EGA) refractory to FOLFOX.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.4_suppl.123 ◽

2017 ◽

Vol 35 (4_suppl) ◽

pp. 123-123

Author(s):

Ciara Marie Kelly ◽

Yelena Yuriy Janjigian ◽

David Paul Kelsen ◽

Marinela Capanu ◽

Joanne F. Chou ◽

...

Keyword(s):

De Novo ◽

Complete Response ◽

Outcome Data ◽

Similar Proportion ◽

Kaplan Meier ◽

Response Status ◽

Poorly Differentiated ◽

Ecog Ps ◽

Next Generation Sequencing Ngs ◽

Ngs Data

123 Background: FOLFOX is a preferred 1st-line tx for advanced EGA. We sought to characterize outcomes on subsequent tx and to see if MSK-IMPACT, a 410-gene next generation sequencing (NGS) platform, increases tx options. Methods: We retrospectively identified patients (pts) with advanced, Her2-negative EGA treated with 1st-line FOLFOX between Jan 2012 to Dec 2014. Clinicopathologic, tx and outcome data were analyzed. Overall survival (OS) was calculated from start of FOLFOX using Kaplan-Meier methods. Landmark analysis was used to compare OS and response status. Results: 185 pts were identified. The majority were Caucasian (82%), male (76%), ECOG PS 1 (67%), with poorly differentiated histology (72%) and de novo metastatic disease (84%). Median age was 64 years. The disease-control rate (DCR, partial response + stable disease) of FOLFOX was 80% [95%CI: 74%-85%]; 19% were FOLFOX primary refractory (FR). Median time-to-progression (TTP) on FOLFOX was 7 and 2 months (mo) for FOLFOX sensitive (FS) and FR pts, respectively. There was a higher proportion of females (26% vs. 14%, P = 0.18), gastric (43% vs. 23%, P = 0.051) and moderately differentiated tumors (26% vs. 12%, p = 0.113) in the FS vs. FR group. Six mo survival from the landmark time of 2 mo after initiation of FOLFOX was 83% [95%CI: 76%-89%], and 38% [95%CI: 20%-56%] for FS and FR pts, respectively (p < 0.01). A similar proportion of FS and FR pts received 2nd-line tx (65% vs. 69%). The DCR was similar in both groups (31% vs 29%). 2nd-line tx included: irinotecan- (51%) and taxane-based regimens (32%) or a clinical trial (CT) (13%). The median TTP on 2nd-line tx was similar in FS and FR groups (2.5 vs 2 mo). Ramucirumab was given in 14% of 2nd line regimens. 3rd-line chemo use was similar in both groups (37% vs 31%) but the DCR was lower in FR patients (18% vs. 9%). 51 pts had IMPACT; 1 pt (2%) enrolled onto a genotyped-matched CT. 14 pts received immunotherapy; 1 FS Pt has ongoing complete response 1+ year. Conclusions: Surprisingly, FS and FR pts derive similar, marginal benefit from 2nd-line tx, emphasizing the appropriateness of CT options in this setting. NGS rarely expanded tx options. Updated and in-depth NGS data will be presented.

Download Full-text

HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intra-Host Viral Populations

Molecular Biology and Evolution ◽

10.1093/molbev/msaa315 ◽

2020 ◽

Author(s):

Matthew L Bendall ◽

Keylie M Gibson ◽

Margaret C Steiner ◽

Uzma Rentia ◽

Marcos Pérez-Losada ◽

...

Keyword(s):

Deep Sequencing ◽

De Novo ◽

Consensus Sequence ◽

Haplotype Reconstruction ◽

Consensus Sequences ◽

Genome Wide ◽

Genomic Regions ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract Deep sequencing of viral populations using next generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intra-host viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.

Download Full-text

ICGEB Workshop on Next Generation Diagnostics, 22/03/2018-24/03/2018, Macedonian Academy of Sciences and Arts, Skopje, Republic of Macedonia

PRILOZI ◽

10.2478/prilozi-2018-0053 ◽

2018 ◽

Vol 39 (2-3) ◽

pp. 137-142

Author(s):

Dimitar Efremov ◽

Momir Polenakovic

Keyword(s):

Biomarker Discovery ◽

Treatment Strategies ◽

Predictive Biomarkers ◽

Bioinformatic Analysis ◽

Next Generation ◽

Republic Of Macedonia ◽

Next Generation Sequencing Ngs ◽

Academy Of Sciences ◽

Ngs Data ◽

Potential Use

Abstract More than 200 participants from Europe, Asia, Africa and South America attended the two days ICGEB Workshop on Next Generation Diagnostics, 22/03/2018-24/03/2018, at the Macedonian Academy of Sciences and Arts (MASA) in Skopje, Republic of Macedonia. The meeting provided an overview of the current and future use of next generation sequencing (NGS), proteomics and other high-throughput technologies in the diagnostic setup of malignant, inherited and communicable diseases. In addition, considerable emphasis was placed on the potential use of these techniques for disease prognostication, patient stratification and monitoring responses to therapy. Specific topics included NGS-based diagnostics of solid tumors, hematological malignancies, inherited and infectious diseases, proteomic-based approaches for biomarker discovery, predictive biomarkers for personalized treatment strategies, and bioinformatic analysis of NGS data. The meeting also provided a unique platform for fruitful discussions between internationally recognized experts and young researchers from developing countries, providing new perspectives and ideas on broader implementation of these techniques for personalized management and care.

Download Full-text

Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity

10.1101/2021.11.15.468652 ◽

2021 ◽

Author(s):

Benjamin Jaegle ◽

Luz Mayela Soto-Jimenez ◽

Robin Burns ◽

Fernando A. Rabanal ◽

Magnus Nordborg

Keyword(s):

Copy Number ◽

Structural Variation ◽

De Novo ◽

Sequencing Data ◽

Heterozygous Snps ◽

Mendelian Segregation ◽

Short Read ◽

Short Read Sequencing ◽

Snp Data ◽

Number Variation

Background: It is becoming apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Recalling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million heterozygous SNPs (44% of total). Given that Arabidopsis thaliana (A. thaliana) is highly selfing, we hypothesized that these SNPs reflected cryptic copy number variation, and investigated them further. Results: While genuine heterozygosity should occur in tracts within individuals, heterozygosity at a particular locus is instead shared across individuals in a manner that strongly suggests it reflects segregating duplications rather than actual heterozygosity. Focusing on pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Conclusions: Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analysing SNP data from short-read sequencing. The finding that 10% of annotated genes are copy-number variables, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.

Download Full-text

A bioinformatic pipeline for NGS data analysis and mutation calling in human solid tumors

Biomeditsinskaya Khimiya ◽

10.18097/pbmc20176305413 ◽

2017 ◽

Vol 63 (5) ◽

pp. 413-417 ◽

Cited By ~ 1

Author(s):

K.Yu. Tsukanov ◽

A.Yu. Krasnenko ◽

D.A. Plakhina ◽

D.O. Korostin ◽

A.V. Churov ◽

...

Keyword(s):

Data Analysis ◽

Solid Tumors ◽

Protein Function ◽

Bioinformatic Analysis ◽

Illumina Hiseq ◽

Wide Range ◽

Cancer Tumor ◽

Human Solid Tumors ◽

Ngs Data Analysis ◽

Ngs Data

We aimed to develop a pipeline for the bioinformatic analysis and interpretation of NGS data and detection of a wide range of single-nucleotide somatic mutations within tumor DNA. Initially, the NGS reads were submitted to a quality control check by the Cutadapt program. Low-quality 3¢-nucleotides were removed. After that the reads were mapped to the reference genome hg19 (GRCh37.p13) by BWA. The SAMtools program was used for exclusion of duplicates. MuTect was used for SNV calling. The functional effect of SNVs was evaluated using the algorithm, including annotation and evaluation of SNV pathogenicity by SnpEff and analysis of such databases as COSMIC, dbNSFP, Clinvar, and OMIM. The effect of SNV on the protein function was estimated by SIFT and PolyPhen2. Mutation frequencies were obtained from 1000 Genomes and ExAC projects, as well as from our own databases with frequency data. In order to evaluate the pipeline we used 18 breast cancer tumor biopsies. The MYbaits Onconome KL v1.5 Panel (“MYcroarray”) was used for targeted enrichment. NGS was performed on the Illumina HiSeq 2500 platform. As a result, we identified alterations in BRCA1, BRCA2, ATM, CDH1, CHEK2, TP53 genes that affected the sequence of encoded proteins. Our pipeline can be used for effective search and annotation of tumor SNVs. In this study, for the first time, we have tested this pipeline for NGS data analysis of samples from patients of the Russian population. However, further confirmation of efficiency and accuracy of the pipeline is required on NGS data from larger datasets as well as data from several types of solid tumors.

Download Full-text

miPIE: NGS-based Prediction of miRNA Using Integrated Evidence

10.1101/405357 ◽

2018 ◽

Author(s):

R.J. Peace ◽

M. Sheikh Hassani ◽

J.R. Green

Keyword(s):

De Novo ◽

Genomic Sequence ◽

Prediction Performance ◽

Data Sets ◽

Mirna Prediction ◽

Individual Contributions ◽

The Individual ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

AbstractMethods for the de novo identification of microRNA (miRNA) have been developed using a range of sequence-based features. With the increasing availability of next generation sequencing (NGS) transcriptome data, there is a need for miRNA identification that integrates both NGS transcript expression-based patterns as well as advanced genomic sequence-based methods. While miRDeep2 does examine the predicted secondary structure of putative miRNA sequences, it does not leverage many of the sequence-based features used in state-of-the-art de novo methods. Meanwhile, other NGS-based methods, such as miRanalyzer, place an emphasis on sequence-based features without leveraging advanced expression-based features reflecting miRNA biosynthesis. This represents an opportunity to combine the strengths of NGS-based analysis with recent advances in de novo sequence-based miRNA prediction. We here develop a method, microRNA Prediction using Integrated Evidence (miPIE), which integrates both expression-based and sequence-based features to achieve significantly improved miRNA prediction performance. Feature selection identifies the 20 most discriminative features, 3 of which reflect strictly expression-based information. Evaluation using precision-recall curves, for six NGS data sets representing six diverse species, demonstrates substantial improvements in prediction performance compared to miRDeep2 and miRanalyzer. The individual contributions of expression-based and sequence-based features are also examined and we demonstrate that their combination is more effective than either alone.

Download Full-text

Genomic Analysis of the Evolution of Fluoroquinolone Resistance in Mycobacterium tuberculosis Prior to Tuberculosis Diagnosis

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.00664-16 ◽

2016 ◽

Vol 60 (11) ◽

pp. 6600-6608 ◽

Cited By ~ 10

Author(s):

Danfeng Zhang ◽

James E. Gomez ◽

Jung-Yien Chien ◽

Nathan Haseley ◽

Christopher A. Desjardins ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

De Novo ◽

Genomic Analysis ◽

Fluoroquinolone Resistance ◽

Drug Resistant ◽

Content Type ◽

Resistant Tuberculosis ◽

History Of ◽

Next Generation Sequencing Ngs ◽

Ngs Data

ABSTRACTFluoroquinolones (FQs) are effective second-line drugs for treating antibiotic-resistant tuberculosis (TB) and are being considered for use as first-line agents. Because FQs are used to treat a range of infections, in a setting of undiagnosed TB, there is potential to select for drug-resistantMycobacterium tuberculosismutants during FQ-based treatment of other infections, including pneumonia. Here we present a detailed characterization of ofloxacin-resistantM. tuberculosissamples isolated directly from patients in Taiwan, which demonstrates that selection for FQ resistance can occur within patients who have not received FQs for the treatment of TB. Several of these samples showed no mutations ingyrAorgyrBbased on PCR-based molecular assays, but genome-wide next-generation sequencing (NGS) revealed minority populations ofgyrAand/orgyrBmutants. In other samples with PCR-detectablegyrAmutations, NGS revealed subpopulations containing alternative resistance-associated genotypes. Isolation of individual clones from these apparently heterogeneous samples confirmed the presence of the minority drug-resistant variants suggested by the NGS data. Further NGS of these purified clones established evolutionary links between FQ-sensitive and -resistant clones derived from the same patient, suggestingde novoemergence of FQ-resistant TB. Importantly, most of these samples were isolated from patients without a history of FQ treatment for TB. Thus, selective pressure applied by FQ monotherapy in the setting of undiagnosed TB infection appears to be able to drive the full or partial emergence of FQ-resistantM. tuberculosis, which has the potential to confound diagnostic tests for antibiotic susceptibility and limit the effectiveness of FQs in TB treatment.

Download Full-text