Essentiality-specific pathogenicity prioritization gene score to improve filtering of disease sequence data

Author(s):  
Dareen Alyousfi ◽  
Diana Baralle ◽  
Andrew Collins

Abstract The causal genetic variants underlying more than 50% of single gene (monogenic) disorders are yet to be discovered. Many patients with conditions likely to have a monogenic basis do not receive a confirmed molecular diagnosis which has potential impacts on clinical management. We have developed a gene-specific score, essentiality-specific pathogenicity prioritization (ESPP), to guide the recognition of genes likely to underlie monogenic disease variation to assist in filtering of genome sequence data. When a patient genome is sequenced, there are frequently several plausibly pathogenic variants identified in different genes. Recognition of the single gene most likely to include pathogenic variation can guide the identification of a causal variant. The ESPP score integrates gene-level scores which are broadly related to gene essentiality. Previous work towards the recognition of monogenic disease genes proposed a model with increasing gene essentiality from ‘non-essential’ to ‘essential’ genes (for which pathogenic variation may be incompatible with survival) with genes liable to contain disease variation positioned between these two extremes. We demonstrate that the ESPP score is useful for recognizing genes with high potential for pathogenic disease-related variation. Genes classed as essential have particularly high scores, as do genes recently recognized as strong candidates for developmental disorders. Through the integration of individual gene-specific scores, which have different properties and assumptions, we demonstrate the utility of an essentiality-based gene score to improve sequence genome filtering.

2021 ◽  
Author(s):  
Lynn Pais ◽  
Hana Snow ◽  
Ben Weisburd ◽  
Shifa Zhang ◽  
Samantha Baxter ◽  
...  

Exome and genome sequencing have become the tools of choice for rare disease diagnosis, leading to large amounts of data available for analyses. To identify causal variants in these datasets, powerful filtering and decision support tools that can be efficiently used by clinicians and researchers are required. To address this need, we developed seqr - an open source, web-based tool for family-based monogenic disease analysis that allows researchers to work collaboratively to search and annotate genomic callsets. To date, seqr is being used in several research pipelines and one clinical diagnostic lab. In our own experience through the Broad Institute Center for Mendelian Genomics, seqr has enabled analyses of over 10,000 families, supporting the diagnosis of more than 3,800 individuals with rare disease and discovery of over 300 novel disease genes. Here we describe a framework for genomic analysis in rare disease that leverages seqr's capabilities for variant filtration, annotation, and causal variant identification, as well as support for research collaboration and data sharing. The seqr platform is available as open source software, allowing low-cost participation in rare disease research, and a community effort to support diagnosis and gene discovery in rare disease.


2013 ◽  
Vol 9 (5) ◽  
pp. e1003073 ◽  
Author(s):  
Wei-Hua Chen ◽  
Xing-Ming Zhao ◽  
Vera van Noort ◽  
Peer Bork

2017 ◽  
Vol 5 (3) ◽  
pp. 89-98
Author(s):  
Moses J. Kiryowa ◽  
Aston Ebinu ◽  
Vincent Kyaligonza ◽  
Stanley T. Nkalubo ◽  
Pamela Paparu ◽  
...  

Colletotrichum lindemuthianum is a highly variable pathogen of common beans that easily overcomes resistance in cultivars bred with single-gene resistance. To determine pathogenic variability of the pathogen in Uganda, samples of common bean tissues with anthracnose symptoms were collected in eight districts of Uganda, namely Kabarole, Sironko, Mbale, Oyam, Lira, Kapchorwa, Maracha and Kisoro. 51 isolates sporulated successfully on Potato Dextrose Agar and Mathur’s media and were used to inoculate 12 differential cultivars under controlled conditions. Five plants per cultivar were inoculated with each isolate and then evaluated for their reaction using the 1 – 9 severity scale. Races were classified using the binary nomenclature system proposed by Pastor Corrales (1991). Variation due to cultivar and isolate effects was significant (P≤0.001) for severity. The 51 isolates from eight districts grouped into 27 different races. Sironko district had the highest number of races followed by Mbale and Kabarole. Races 2047 and 4095 were the most frequently found, each with 10 isolates grouped under them. Race 4095 was the most virulent since it caused a susceptible (S) reaction on all 12 differential cultivars and the susceptible check. This was followed by races 2479, 2047 and 2045 respectively. Two races, 4094 and 2479, caused a susceptible reaction on the differential cultivar G2333, which nevertheless, showed the most broad spectrum resistance followed by cultivars Cornell 49-242, TU, and AB136 respectively. These cultivars are recommended for use in breeding programs aiming at breeding for broad spectrum resistance to bean anthracnose in Uganda.


2021 ◽  
Author(s):  
Thabo Michael Yates ◽  
Antoine Lain ◽  
Jamie Campbell ◽  
T. Ian Simpson ◽  
David R FitzPatrick

There are >2500 different genetically-determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for extraction of categorical phenotypic descriptors from full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76-83% precision and 72-81% recall. Mean terms per paper increased from 9 in title + abstract, to 69 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than gold standard manually-curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. AUC for ROC curves increased by 5-10% through use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines.


2020 ◽  
Vol 11 ◽  
Author(s):  
Alejandro Abdala Asbun ◽  
Marc A. Besseling ◽  
Sergio Balzano ◽  
Judith D. L. van Bleijswijk ◽  
Harry J. Witte ◽  
...  

Marker gene sequencing of the rRNA operon (16S, 18S, ITS) or cytochrome c oxidase I (CO1) is a popular means to assess microbial communities of the environment, microbiomes associated with plants and animals, as well as communities of multicellular organisms via environmental DNA sequencing. Since this technique is based on sequencing a single gene, or even only parts of a single gene rather than the entire genome, the number of reads needed per sample to assess the microbial community structure is lower than that required for metagenome sequencing. This makes marker gene sequencing affordable to nearly any laboratory. Despite the relative ease and cost-efficiency of data generation, analyzing the resulting sequence data requires computational skills that may go beyond the standard repertoire of a current molecular biologist/ecologist. We have developed Cascabel, a scalable, flexible, and easy-to-use amplicon sequence data analysis pipeline, which uses Snakemake and a combination of existing and newly developed solutions for its computational steps. Cascabel takes the raw data as input and delivers a table of operational taxonomic units (OTUs) or Amplicon Sequence Variants (ASVs) in BIOM and text format and representative sequences. Cascabel is a highly versatile software that allows users to customize several steps of the pipeline, such as selecting from a set of OTU clustering methods or performing ASV analysis. In addition, we designed Cascabel to run in any linux/unix computing environment from desktop computers to computing servers making use of parallel processing if possible. The analyses and results are fully reproducible and documented in an HTML and optional pdf report. Cascabel is freely available at Github: https://github.com/AlejandroAb/CASCABEL.


2020 ◽  
Vol 48 (16) ◽  
pp. e91-e91
Author(s):  
Yatish Turakhia ◽  
Heidi I Chen ◽  
Amir Marcovitz ◽  
Gill Bejerano

Abstract Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (amino acid deletions and substitutions) and sister species support as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using human as reference, we discovered over 400 unique human ortholog erosion events across 58 mammals. This includes dozens of clade-specific losses of genes that result in early mouse lethality or are associated with severe human congenital diseases. Our discoveries yield intriguing potential for translational medical genetics and evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.


Parasitology ◽  
2011 ◽  
Vol 138 (13) ◽  
pp. 1760-1777 ◽  
Author(s):  
LAURA M. McDONAGH ◽  
JAMIE R. STEVENS

SUMMARYThe Calliphoridae include some of the most economically significant myiasis-causing flies in the world – blowflies and screwworm flies – with many being notorious for their parasitism of livestock. However, despite more than 50 years of research, key taxonomic relationships within the family remain unresolved. This study utilizes nucleotide sequence data from the protein-coding genes COX1 (mitochondrial) and EF1α (nuclear), and the 28S rRNA (nuclear) gene, from 57 blowfly taxa to improve resolution of key evolutionary relationships within the family Calliphoridae. Bayesian phylogenetic inference was carried out for each single-gene data set, demonstrating significant topological difference between the three gene trees. Nevertheless, all gene trees supported a Calliphorinae-Luciliinae subfamily sister-lineage, with respect to Chrysomyinae. In addition, this study also elucidates the taxonomic and evolutionary status of several less well-studied groups, including the genus Bengalia (either within Calliphoridae or as a separate sister-family), genus Onesia (as a sister-genera to, or sub-genera within, Calliphora), genus Dyscritomyia and Lucilia bufonivora, a specialised parasite of frogs and toads. The occurrence of cross-species hybridisation within Calliphoridae is also further explored, focusing on the two economically significant species Lucilia cuprina and Lucilia sericata. In summary, this study represents the most comprehensive molecular phylogenetic analysis of family Calliphoridae undertaken to date.


2007 ◽  
Vol 05 (06) ◽  
pp. 1155-1172 ◽  
Author(s):  
BRIAN M. O'LEARY ◽  
STEVEN G. DAVIS ◽  
MICHAEL F. SMITH ◽  
BARTLEY BROWN ◽  
MATHEW B. KEMP ◽  
...  

When searching for disease-causing mutations with polymerase chain reaction (PCR)-based methods, candidate genes are usually screened in their entirety, exon by exon. Genomic resources (i.e. www.ncbi.nih.gov, www.ensembl.org, and genome.ucsc.edu) largely support this paradigm for mutation screening by making it easy to view and access sequence data associated with genes in their genomic context. However, the administrative burden of conducting mutation screening in potentially hundreds of genes and thousands of exons in thousands of patients is significant, even with the use of public genome resources. For example, the manual design of oligonucleotide primers for all exons of the 10 Leber's congenital amaurosis (LCA) genes (149 exons) represents a significant information management challenge. The Transcript Annotation Prioritization and Screening System (TrAPSS) is designed to accelerate mutation screening by (1) providing a gene-based local cache of candidate disease genes in a genomic context, (2) automating tasks associated with optimizing candidate disease gene screening and information management, and (3) providing the implementation of an algorithmic technique to utilize large amounts of heterogeneous genome annotation (e.g. conserved protein functional domains) so as to prioritize candidate genes.


Sign in / Sign up

Export Citation Format

Share Document