Improved polygenic prediction by Bayesian multiple regression on summary statistics

Luke R. Lloyd-Jones; Jian Zeng; Julia Sidorenko; Loïc Yengo; Gerhard Moser; Kathryn E. Kemper; Huanwei Wang; Zhili Zheng; Reedik Magi; Tõnu Esko; Andres Metspalu; Naomi R. Wray; Michael E. Goddard; Jian Yang; Peter M. Visscher

doi:10.1038/s41467-019-12653-0

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Nature Communications ◽

10.1038/s41467-019-12653-0 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 34

Author(s):

Luke R. Lloyd-Jones ◽

Jian Zeng ◽

Julia Sidorenko ◽

Loïc Yengo ◽

Gerhard Moser ◽

...

Keyword(s):

Multiple Regression ◽

Association Studies ◽

Meta Analysis ◽

Multiple Regression Model ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Level Data ◽

The Uk

Abstract Accurate prediction of an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.

Download Full-text

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

10.1101/042457 ◽

2016 ◽

Cited By ~ 5

Author(s):

Xiang Zhu ◽

Matthew Stephens

Keyword(s):

Multiple Regression ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

Wide Range

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.

Download Full-text

Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction

10.1101/2020.11.27.401141 ◽

2020 ◽

Author(s):

Clara Albiñana ◽

Jakob Grove ◽

John J. McGrath ◽

Esben Agerbo ◽

Naomi R. Wray ◽

...

Keyword(s):

Association Studies ◽

Meta Analysis ◽

Training Sample ◽

Risk Scores ◽

Large Individual ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Individual Level ◽

Level Data

AbstractThe accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWAS). However, it is now common for researchers to have access to large individual-level data as well, such as the UK biobank data. To the best of our knowledge, it has not yet been explored how to best combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (Meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using twelve real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare Meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and Meta-PRS. We find that, when large individual-level data is available, the linear combination of PRSs (Meta-PRS) is both a simple alternative to Meta-GWAS and often more accurate.

Download Full-text

Estimating Heritability and Genetic Correlation in Case Control Studies Directly and with Summary Statistics

10.1101/256388 ◽

2018 ◽

Author(s):

Omer Weissbrod ◽

Jonathan Flint ◽

Saharon Rosset

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Genetic Correlations ◽

Large Data ◽

Case Control ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Case Control Studies ◽

Individual Level

AbstractMethods that estimate heritability and genetic correlations from genome-wide association studies have proven to be powerful tools for investigating the genetic architecture of common diseases and exposing unexpected relationships between disorders. Many relevant studies employ a case-control design, yet most methods are primarily geared towards analyzing quantitative traits. Here we investigate the validity of three common methods for estimating genetic heritability and genetic correlation. We find that the Phenotype-Correlation-Genotype-Correlation (PCGC) approach is the only method that can estimate both quantities accurately in the presence of important non-genetic risk factors, such as age and sex. We extend PCGC to work with summary statistics that take the case-control sampling into account, and demonstrate that our new method, PCGC-s, accurately estimates both heritability and genetic correlations and can be applied to large data sets without requiring individual-level genotypic or phenotypic information. Finally, we use PCGC-S to estimate the genetic correlation between schizophrenia and bipolar disorder, and demonstrate that previous estimates are biased due to incorrect handling of sex as a strong risk factor. PCGC-s is available at https://github.com/omerwe/PCGCs.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

10.1101/2020.10.20.20216358 ◽

2020 ◽

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K Rutter ◽

Deborah A Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

AbstractOver the last decade the availability of SNP-trait associations from genome-wide association studies data has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification.In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. A weighted sum of these estimates is then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes.Our approach is closely related to the work of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our paper serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

metaCCA: Summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

10.1101/022665 ◽

2015 ◽

Cited By ~ 1

Author(s):

Anna Cichonska ◽

Juho Rousu ◽

Pekka Marttinen ◽

Antti J Kangas ◽

Pasi Soininen ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Statistical Power ◽

Association Studies ◽

Meta Analysis ◽

Original Data ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level

A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analysing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness. Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

PLoS Genetics ◽

10.1371/journal.pgen.1009703 ◽

2021 ◽

Vol 17 (8) ◽

pp. e1009703

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K. Rutter ◽

Deborah A. Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

Over the last decade the availability of SNP-trait associations from genome-wide association studies has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification. In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. These estimates are then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes. Our approach can be viewed as a generalization of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our work serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

LEP: A Statistical Method Integrating Individual-Level and Summary-Level Data of the Same Trait From Different Populations

Biomedical Informatics Insights ◽

10.1177/1178222619881624 ◽

2019 ◽

Vol 11 ◽

pp. 117822261988162

Author(s):

Mingwei Dai ◽

Jin Liu ◽

Can Yang

Keyword(s):

Association Studies ◽

Relevant Information ◽

Joint Analysis ◽

Data Sets ◽

Genome Wide Association Studies ◽

Statistical Efficiency ◽

Individual Level ◽

Level Data ◽

Multiple Data Sets ◽

Different Populations

Statistical approaches for integrating multiple data sets in genome-wide association studies (GWASs) are increasingly important. Proper utilization of more relevant information is expected to improve statistical efficiency in the analysis. Among these approaches, LEP was proposed for joint analysis of individual-level data and summary-level data in the same population by leveraging pleiotropy. The key idea of LEP is to explore correlation of the association status among different data sets while accounting for the heterogeneity. In this commentary, we show that LEP is applicable to integrate individual-level data and summary-level data of the same trait from different populations, providing new insights into the genetic architecture of different populations.

Download Full-text

Cigarette smoking and personality: interrogating causality using Mendelian randomisation

Psychological Medicine ◽

10.1017/s0033291718003069 ◽

2018 ◽

Vol 49 (13) ◽

pp. 2197-2205 ◽

Cited By ~ 1

Author(s):

Hannah M. Sallis ◽

George Davey Smith ◽

Marcus R. Munafò

Keyword(s):

Personality Traits ◽

Association Studies ◽

Smoking Initiation ◽

Mendelian Randomisation ◽

Genome Wide Association Studies ◽

Individual Level ◽

Causal Pathways ◽

Genome Wide ◽

Level Data ◽

Causal Nature

AbstractBackgroundDespite the well-documented association between smoking and personality traits such as neuroticism and extraversion, little is known about the potential causal nature of these findings. If it were possible to unpick the association between personality and smoking, it may be possible to develop tailored smoking interventions that could lead to both improved uptake and efficacy.MethodsRecent genome-wide association studies (GWAS) have identified variants robustly associated with both smoking phenotypes and personality traits. Here we use publicly available GWAS summary statistics in addition to individual-level data from UK Biobank to investigate the link between smoking and personality. We first estimate genetic overlap between traits using LD score regression and then use bidirectional Mendelian randomisation methods to unpick the nature of this relationship.ResultsWe found clear evidence of a modest genetic correlation between smoking behaviours and both neuroticism and extraversion. We found some evidence that personality traits are causally linked to certain smoking phenotypes: among current smokers each additional neuroticism risk allele was associated with smoking an additional 0.07 cigarettes per day (95% CI 0.02–0.12, p = 0.009), and each additional extraversion effect allele was associated with an elevated odds of smoking initiation (OR 1.015, 95% CI 1.01–1.02, p = 9.6 × 10−7).ConclusionWe found some evidence for specific causal pathways from personality to smoking phenotypes, and weaker evidence of an association from smoking initiation to personality. These findings could be used to inform future smoking interventions or to tailor existing schemes.

Download Full-text

Gene-based analysis of ADHD using PASCAL: a biological insight into the novel associated genes

BMC Medical Genomics ◽

10.1186/s12920-019-0593-5 ◽

2019 ◽

Vol 12 (1) ◽

Author(s):

Aitana Alonso-Gonzalez ◽

Manuel Calaza ◽

Cristina Rodriguez-Fontenla ◽

Angel Carracedo

Keyword(s):

Gene Network ◽

Association Studies ◽

Meta Analysis ◽

Neurodevelopmental Disorder ◽

Differentially Expressed Gene ◽

Brain Regions ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Biological Insight ◽

Insight Into

Abstract Background Attention-Deficit Hyperactivity Disorder (ADHD) is a complex neurodevelopmental disorder (NDD) which may significantly impact on the affected individual’s life. ADHD is acknowledged to have a high heritability component (70–80%). Recently, a meta-analysis of GWAS (Genome Wide Association Studies) has demonstrated the association of several independent loci. Our main aim here, is to apply PASCAL (pathway scoring algorithm), a new gene-based analysis (GBA) method, to the summary statistics obtained in this meta-analysis. PASCAL will take into account the linkage disequilibrium (LD) across genomic regions in a different way than the most commonly employed GBA methods (MAGMA or VEGAS (Versatile Gene-based Association Study)). In addition to PASCAL analysis a gene network and an enrichment analysis for KEGG and GO terms were carried out. Moreover, GENE2FUNC tool was employed to create gene expression heatmaps and to carry out a (DEG) (Differentially Expressed Gene) analysis using GTEX v7 and BrainSpan data. Results PASCAL results have revealed the association of new loci with ADHD and it has also highlighted other genes previously reported by MAGMA analysis. PASCAL was able to discover new associations at a gene level for ADHD: FEZF1 (p-value: 2.2 × 10− 7) and FEZF1-AS1 (p-value: 4.58 × 10− 7). In addition, PASCAL has been able to highlight association of other genes that share the same LD block with some previously reported ADHD susceptibility genes. Gene network analysis has revealed several interactors with the associated ADHD genes and different GO and KEGG terms have been associated. In addition, GENE2FUNC has demonstrated the existence of several up and down regulated expression clusters when the associated genes and their interactors were considered. Conclusions PASCAL has been revealed as an efficient tool to extract additional information from previous GWAS using their summary statistics. This study has identified novel ADHD associated genes that were not previously reported when other GBA methods were employed. Moreover, a biological insight into the biological function of the ADHD associated genes across brain regions and neurodevelopmental stages is provided.

Download Full-text

metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

Bioinformatics ◽

10.1093/bioinformatics/btw052 ◽

2016 ◽

Vol 32 (13) ◽

pp. 1981-1989 ◽

Cited By ~ 66

Author(s):

Anna Cichonska ◽

Juho Rousu ◽

Pekka Marttinen ◽

Antti J. Kangas ◽

Pasi Soininen ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Association Studies ◽

Meta Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide

Download Full-text