Abstract
Background: The genome-wide association study (GWAS) is a common tool to identify genetic variants associated with complex traits, including psychiatric disorders (PDs). However, post-GWAS analyses are needed to extend the statistical inference to biologically relevant entities, e.g., genes, proteins, and pathways. To achieve this goal, researchers developed methods that incorporate biologically relevant intermediate molecular phenotypes, such as gene expression and protein abundance, which are posited to mediate the variant-trait association. Transcriptome-wide association study (TWAS) and proteome-wide association study (PWAS) are commonly used methods to test the association between these molecular mediators and the trait. Summary: In this review, we discuss the most recent developments in TWAS and PWAS. These methods integrate existing “omic” information with the GWAS summary statistics for trait(s) of interest. Specifically, they impute transcript/protein data and test the association between imputed gene expression/protein level with phenotype of interest by using (i) GWAS summary statistics and (ii) reference transcriptomic/proteomic/genomic datasets. TWAS and PWAS are suitable as analysis tools for (i) primary association scan and (ii) fine-mapping to identify potentially causal genes for PDs. Key Messages: As post-GWAS analyses, TWAS and PWAS have the potential to highlight causal genes for PDs. These prioritized genes could indicate targets for the development of novel drug therapies. For researchers attempting such analyses, we recommend Mendelian randomization tools that use GWAS statistics for both trait and reference datasets, e.g., summary Mendelian randomization (SMR). We base our recommendation on (i) being able to use the same tool for both TWAS and PWAS, (ii) not requiring the pre-computed weights (and thus easier to update for larger reference datasets), and (iii) most larger transcriptome reference datasets are publicly available and easy to transform into a compatible format for SMR analysis.
Introduction
Genome-wide association study (GWAS) is a statistical framework for univariate testing of the association between a trait outcome and genetic variants, such as single nucleotide polymorphisms (SNPs), among the thousands or millions of genotyped variants in a genome-wide scan (see Fig. 1 for definitions of key terms). GWAS signals are often found in non-coding segments of the human genome [1]. In certain genomic regions with these signals, large correlations are detected between variants (due to neighboring allele co-segregation). This is commonly known as linkage disequilibrium (LD) (Fig. 2). As a result, significant GWAS signals are observed across multiple genes (or loci). Therefore, it is not straightforward to infer a biologically informative association between these signals and the trait [2]. Consequently, post-GWAS methods are needed to connect variant signals to genes using biologically relevant mediators like gene transcript and protein levels. The enrichment of GWAS signals among variants that influence transcript/protein levels provides evidence that these molecular measures mediate the association of variant with a trait [3, 4]. Researchers can better understand the underlying etiology of disorders by integrating transcriptome and genome data. This enables them to prioritize biologically relevant GWAS signals associated with the trait. Prioritized genes would ultimately indicate molecular targets for future therapeutic intervention or clinical diagnosis/prognosis in common diseases [5, 6].
Triangular matrix with color shades showing the strength of the LD (r2, square of pair-wise Pearson’s correlation coefficient). Darker colors denote higher LD between SNP pairs in the designated locus.
Triangular matrix with color shades showing the strength of the LD (r2, square of pair-wise Pearson’s correlation coefficient). Darker colors denote higher LD between SNP pairs in the designated locus.
Loci associated with transcript levels are expression quantitative trait loci (eQTLs) and those associated with protein levels are protein QTLs (pQTLs). Genes with at least one eQTL are expression genes (eGenes). Large consortia have discovered eQTLs and pQTLs for various tissues/fluids by testing the correlation between genetic variants and transcript [7, 8] and protein levels [9‒12]. Henceforth, we refer to this association data from eQTL/pQTL studies as reference transcriptome/proteome data.
By combining GWAS results with information on eQTLs (variant weights derived from reference data), it is possible to impute gene expression [13‒15] and thus researchers can perform a gene-based association scan between trait and predicted transcript level. Such an approach is known as a transcriptome-wide association study (TWAS). It is a widely used method to uncover potentially causal loci and genes that were not previously identified as significant by GWAS [16, 17].
Although profiling the human proteome is not yet as advanced as DNA and RNA sequencing, it has recently made notable progress in terms of scaling up the number of analytes [18]. Reference proteome data in population-based cohorts have enabled the discovery of pQTLs, e.g., in human blood plasma [4] and brain tissues [19]. Similar to the TWAS paradigm, proteome-wide association studies (PWASs) are conducted by integrating GWAS summary statistics and reference pQTL information [20‒22] to test the association between trait and protein abundance (Fig. 3a). These methods simply test the association between trait and predicted gene expression/protein levels (Fig. 3b).
a Associations among omics of biological complexity at molecular level. b TWAS and PWAS aim to infer a potential causal path between the predicted transcript/protein level and the risk for disease.
a Associations among omics of biological complexity at molecular level. b TWAS and PWAS aim to infer a potential causal path between the predicted transcript/protein level and the risk for disease.
To help researchers uncover the molecular basis of GWAS signals, we discuss the TWAS/PWAS methods and their associated transcriptome/proteome reference data. The main text of the review is divided into three sections: GWAS, TWAS, and PWAS. First, we briefly review GWAS, its shortcomings, and the need for post-GWAS analyses. Second, we lay out the need for biologically informative mediators (transcriptome and proteome data) that are empirically supported in the literature. We provide details on various transcriptome and proteome reference datasets as well as TWAS methods, which are also used in PWAS. Finally, we discuss the limitations and possible future improvements of the TWAS and PWAS approaches.
Methods to Link Genome, Transcriptome, and Proteome Data to Phenotypes
Genome-Wide Association Studies
Advances in genotyping technology allowed for the development of GWAS - a genome-wide scanning for the association signals between the genotype and the phenotype using a regression model (Fig. 4). The ultimate aim of GWAS is to identify common risk variants (minor allele frequency >0.01). However, as common variants have generally small effects on phenotypes, for signal detection in GWAS, large sample sizes are essential. With adequate sample sizes, GWASs provided numerous statistically significant association signals for many disorders with complex etiologies, including metabolic, neurodegenerative, and neuropsychiatric disorders [23]. Although some GWAS analyses employ mixed effect models or logistic regression, we give details for only the simple linear model (Fig. 4).
Statistical model testing for the association between genetic variants and the trait of interest.
Statistical model testing for the association between genetic variants and the trait of interest.
Limitations of GWAS in Identifying Biologically Informative Loci
GWAS signals are limited in explaining the biological mechanisms involved in complex traits. While there are many reasons for this, we discuss two of the most important ones. First, most signals are found in the noncoding regions of the human genome [1, 24], i.e., genomic regions that do not code for proteins. Second, observing large correlations between alleles of SNPs in close proximity, which is generally known as LD (Fig. 2) [25]. A genomic region with elevated pairwise correlations between alleles at two or more loci is commonly denoted as an “LD block” [26, 27]. This phenomenon is a direct consequence of the lack of recombination at the LD block in meiosis through many consecutive generations [25]. Because any causal signal within an LD block induces associations with numerous other SNPs within the same block, further fine-mapping of the region is necessary in order to identify the causal variant(s).
Support for Biologically Relevant Mediators between Genes and Traits
While few GWAS signals are directly informative for the biological underpinnings of the disease, researchers showed that these genomic regions were enriched in regulatory effects on gene expression [2, 3] and protein abundance levels [28]. GWASs provided empirical support for transcript levels as a variant-trait mediator by showing enrichment of eQTL signals in association signals of many common disorders [3, 29, 30]. Similarly, pQTLs were shown to co-localize with GWAS signals in traits with complex etiologies [31]. Thus, the empirical evidence supports that RNA transcript levels and protein abundance serve as molecular mediators of variant-trait association.
Transcriptomics: TWASs
Currently, there are a large number of TWAS methods (Table 1). PrediXcan, TWAS, and S-PrediXcan were the first among these methods [13, 14, 32]. To summarize the analysis steps, TWAS methods first use a reference cohort that has genetic and transcriptomic data from the same subjects (Fig. 5a). Second, it selects the best prediction model for imputing the gene expression from the individual-level genotype (Fig. 5b). Third, it uses the best fitting prediction model to predict genetically regulated gene expression (GReX) (Fig. 5c). The fourth step is to compute the trait-gene expression association statistic by regressing outcome on GReX (and relevant covariates) (Fig. 5d). The prediction model in TWAS requires pre-computed weights for each gene and tissue under investigation. The weights fall into three main categories. The first is based on single tissue models including peripheral blood, whole blood, adipose tissue, and brain tissues. The second one is the multiple tissue-based model, whose weights are estimated from GTEX v8. The third one is the cross-tissue model weights [33] which are based on GTEx v6 and GTEx v8.
TWAS methods for inferring gene-trait association and eQTL colocalization
Method . | GWAS input . | References . | Prediction model or statistical test . | Tissue . |
---|---|---|---|---|
Gene-level TWAS methods | ||||
JEPEG | GSS | Lee et al., [34] 2015 | Multivariate gene based | Single |
PrediXcan | GT | Gamazon et al., [13] 2015 | Elastic net, LASSO | Single |
TWAS/TWAS-FUSION | GT/GSS | Gusev et al., [14] 2016 | BLUP, BSLMM/Elastic net, LASSO | Single |
JEPEGMIX | GSS | Lee et al., [35] 2016 | Mix ancestry multivariate gene based | Single |
SMR | GSS | Zhu et al., [36] 2016 | MR | Single |
S-PrediXcan | GSS | Barbeira et al., [37] 2018 | Elastic net, LASSO | Single |
COMM | GT | Yang et al., [38] 2019 | Mixed joint model | Single |
MultiXcan | GSS | Barbeira et al., [32] 2019 | Cross tissue PCA regression | Multiple |
UTMOST | GSS | Hu et al., [39] 2019 | Cross tissue Group LASSO and GBJ | Multiple |
TIGAR | GT/GSS | Nagpal et al., [40] 2019 | Dirichlet process regression | Single |
COMM-S2 | GSS | Yang et al., [41] 2020 | Mixed joint model | Single |
TisCoMM/ TisCoMM-S2 | GT/GSS | Shi et al., [42] 2020 | Mixed joint model | Multiple |
PMR-Egger | GT/GSS | Yuan et al., [43] 2020 | Probabilistic two-sample MR | Single |
BGW-TWAS | GSS | Luningham et al., [44] 2020 | Bayesian variable selection regression* | Single |
MR-JTI | GT | Zhou et al., [45] 2020 | MR | Multiple |
InTACT | GT/GSS | Bae et al., [46] 2021 | BLUP and Cauchy distribution | Multiple |
VC-TWAS | GT/GSS | Tang et al., [47] 2021 | SKAT** | Single |
TWAS fine-mapping methods | ||||
FOCUS | GSS | Mancuso et al., [48] 2019 | BSLMM, GBLUP | Single |
FOGS | GSS | Wu and Pan [49] 2020 | SPU*** | Single |
HEIDI | GSS | Zhu et al., [36] 2016 | MR | Single |
TWAS pathway analysis | ||||
JEPEGMIX2-P | GSS | Chatzinakos et al., [50] 2020 | Pathway analysis in multiple ancestries | Single |
TWAS-GSEA | GSS | Pain et al., [51] 2019 | Linear mixed model | Single |
eQTL/GWAS signal-based colocalization methods | ||||
Coloc | GSS | Giambartolomei et al., [52] 2014 | Bayesian test (colocalization) | Single |
eCAVIAR | GSS | Hormozdiari et al., [53] 2016 | Bayesian test (posterior colocalization probability) | Single |
Enloc | GSS | Wen et al., [54] 2017 | Bayesian hierarchical model | Single |
Method . | GWAS input . | References . | Prediction model or statistical test . | Tissue . |
---|---|---|---|---|
Gene-level TWAS methods | ||||
JEPEG | GSS | Lee et al., [34] 2015 | Multivariate gene based | Single |
PrediXcan | GT | Gamazon et al., [13] 2015 | Elastic net, LASSO | Single |
TWAS/TWAS-FUSION | GT/GSS | Gusev et al., [14] 2016 | BLUP, BSLMM/Elastic net, LASSO | Single |
JEPEGMIX | GSS | Lee et al., [35] 2016 | Mix ancestry multivariate gene based | Single |
SMR | GSS | Zhu et al., [36] 2016 | MR | Single |
S-PrediXcan | GSS | Barbeira et al., [37] 2018 | Elastic net, LASSO | Single |
COMM | GT | Yang et al., [38] 2019 | Mixed joint model | Single |
MultiXcan | GSS | Barbeira et al., [32] 2019 | Cross tissue PCA regression | Multiple |
UTMOST | GSS | Hu et al., [39] 2019 | Cross tissue Group LASSO and GBJ | Multiple |
TIGAR | GT/GSS | Nagpal et al., [40] 2019 | Dirichlet process regression | Single |
COMM-S2 | GSS | Yang et al., [41] 2020 | Mixed joint model | Single |
TisCoMM/ TisCoMM-S2 | GT/GSS | Shi et al., [42] 2020 | Mixed joint model | Multiple |
PMR-Egger | GT/GSS | Yuan et al., [43] 2020 | Probabilistic two-sample MR | Single |
BGW-TWAS | GSS | Luningham et al., [44] 2020 | Bayesian variable selection regression* | Single |
MR-JTI | GT | Zhou et al., [45] 2020 | MR | Multiple |
InTACT | GT/GSS | Bae et al., [46] 2021 | BLUP and Cauchy distribution | Multiple |
VC-TWAS | GT/GSS | Tang et al., [47] 2021 | SKAT** | Single |
TWAS fine-mapping methods | ||||
FOCUS | GSS | Mancuso et al., [48] 2019 | BSLMM, GBLUP | Single |
FOGS | GSS | Wu and Pan [49] 2020 | SPU*** | Single |
HEIDI | GSS | Zhu et al., [36] 2016 | MR | Single |
TWAS pathway analysis | ||||
JEPEGMIX2-P | GSS | Chatzinakos et al., [50] 2020 | Pathway analysis in multiple ancestries | Single |
TWAS-GSEA | GSS | Pain et al., [51] 2019 | Linear mixed model | Single |
eQTL/GWAS signal-based colocalization methods | ||||
Coloc | GSS | Giambartolomei et al., [52] 2014 | Bayesian test (colocalization) | Single |
eCAVIAR | GSS | Hormozdiari et al., [53] 2016 | Bayesian test (posterior colocalization probability) | Single |
Enloc | GSS | Wen et al., [54] 2017 | Bayesian hierarchical model | Single |
GT, genotype data as input from GWAS; GSS, GWAS summary statistics; JEPEG, joint effect on phenotype of eQTL/functional SNPs associated with a gene; TWAS, transcriptome-wide association, JEPEGMIX, JEPEG for mixed ethnicity cohorts; SMR, summary data-based Mendelian randomization, COMM, collaborative mixed model; UTMOST, uunified test for molecular signature; GBJ; generalized Berk-Jones; TIGAR, transcriptome integrated genetic association resource; COMM-S2, collaborative mixed models for GWAS summary statistics; TisCoMM, tissue-specific collaborative mixed model; PMR-Egger, probabilistic two-sample Mendelian randomization-Egger regression; BGW-TWAS, Bayesian genome-wide TWAS; MR-JTI, Mendelian randomization joint tissue imputation; InTACT, integrative test of associations via Cauchy transformation; VC-TWAS, novel variance-component TWAS; FOCUS, fine-mapping of causal genes; FOGS, fine-mapping of gene sets; JEPEGMIX2-P, JEPEGMIX2 pathway; TWAS-GSEA, TWAS-based gene set enrichment analysis; HEIDI, heterogeneity in dependent instruments; coloc, colocalization; eCAVIAR, eQTL causal variants identification in associated regions; enloc, enrichment estimation-aided colocalization analysis; BLUP, best linear unbiased predictor; LASSO, least absolute shrinkage and selection operator; PCA, principal component analysis; SUP, sum of powered score; BSLMM, Bayesian sparse linear mixed model .
*The method for statistical inference was from another study [55].
**The statistical model is published elsewhere in detail apart from the main research article [56].
***The primary method relies on another study [57].
a Gene expression matrices have transcript-level measurement for each gene by rows. Columns are associated with individual donors for different tissues shown as color coded. The intensity of the color shows the amount of the transcript levels. b Transcriptome imputation relies on the eGene statistical prediction model to assign weights to each cis-eQTL. c These weights from GReX model will be used for imputing transcript levels. d TWAS is association between trait and predicted transcript level.
a Gene expression matrices have transcript-level measurement for each gene by rows. Columns are associated with individual donors for different tissues shown as color coded. The intensity of the color shows the amount of the transcript levels. b Transcriptome imputation relies on the eGene statistical prediction model to assign weights to each cis-eQTL. c These weights from GReX model will be used for imputing transcript levels. d TWAS is association between trait and predicted transcript level.
Subject-level genotype data for cohorts are not easy to access due to privacy concerns. Subsequently, S-PrediXcan and TWAS-FUSION do not require individual level genetic data by employing only GWAS summary statistics for traits under investigation. Because these methods infer only association and are not directly concerned with the causality from GReX to trait, we refer to them as primary TWAS methods.
Primary TWAS Methods
PrediXcan. Among primary TWAS methods, one of the first was PrediXcan [13]. It was followed by its GWAS summary statistics-based version, S-PrediXcan [37]. These methods select the optimal prediction model using regularization methods, e.g., LASSO [58] or elastic net [59]. The prediction models used in PrediXcan are available in PredictDB (see data resources listed in Data Availability Statement). Some of the most recent updates on transcriptome prediction models are based on PsychENCODE [60], GTEx v8 [8] and its cross-tissue models [39]. While for GTEx v8, they used elastic net for the model selection method, the improved versions employ a multivariate adaptive shrinkage (mash) [61]. This method also has an R implementation called mashr (see data resources listed in Data Availability Statement).
TWAS/TWAS-FUSION. TWAS [14] and its updated version, TWAS-FUSION, employ LD between SNP pairs and their GWAS Z-scores to impute the transcriptomic Z-score for a gene. To generate these weights for each SNP from the best GReX prediction model, TWAS-FUSION uses the best linear unbiased predictor, Bayesian sparse linear mixed model [62], elastic net, LASSO, or top SNP. The pre-computed weights for prediction models are publicly available (see data resources listed in Data Availability Statement).
JEPEG/JEPEGMIX
JEPEG [34] is another TWAS-like gene-level method that modeled several classes of functional effects of SNP (eQTL, transcription factor binding, or miRNA associated) on the phenotype. Its extension to cosmopolitan cohorts, JEPEGMIX, became a pure gene-level TWAS tool later [35]. The latest iteration of the tool, JEPEGMIX2-P, added the much-needed pathway-level inference for TWAS analysis. To the best of our knowledge, this method is still the only tool that makes pathway inferences by directly estimating the LD of gene-level TWAS statistics (even in multiple ancestry cohorts).
SMR
Since the causal path from genetic variants to trait is posited to be mediated by gene expression, Mendelian randomization (MR) methods are very effective in estimating the effect of gene expression on the trait of interest. For statistical inference, MR tools need to estimate the effect of genetic variants (instrumental variables) on (i) gene expression (exposure) and (ii) trait (outcome). Summary Mendelian randomization (SMR) extended the MR framework to infer gene expression-trait association using only GWAS summary statistics [36]. Its main advantage is that it does not require pre-computed predictive weights for transcripts/protein abundances, unlike many other TWAS methods (S-PrediXcan, TWAS-FUSION, JEPEGMIX, etc.). SMR requires inputs of summary statistics from GWAS and eQTL studies, which are assumed to be independent cohorts.
Multiple Tissue TWAS Methods
Multiple tissue integration TWAS methods aggregate TWAS effect estimates from different tissues. Some of these methods are updated versions of primary TWAS methods. It is well established that different types of tissues share eQTLs, especially the cis-eQTLs that most TWAS methods use [63]. Thus, prediction models can be improved by combining eQTL information across tissues. Because reference eQTL data were limited at the beginning, multi-tissue approaches were developed to improve prediction power for under-sampled tissues [32]. MultiXcan [32], UTMOST [39], TisCoMM/TisCoMM-S2 [42], MR-JTI [45], and InTACT [46] are examples of this group of TWAS methods. We briefly present some of these.
Among these methods, UTMOST [39] uses a generalized Berk Johns method to combine the joint tissue covariance structure and the single tissue TWAS effects. MultiXcan jointly analyzes PrediXcan effect estimates from all or multiple tissues [32]. MR-JTI is a multi-tissue integration TWAS that uses MR for causal inference. MR-JTI has been shown to outperform UTMOST and PrediXcan [45] (see online suppl. Table 1 for more details on the TWAS methods; for all online suppl. material, see https://doi.org/10.1159/000530223).
TWAS Fine-Mapping Methods
Similar to overlapping significant GWAS signals at the variant level, TWAS yields multiple gene signals at a single locus due to LD. To get closer to the true causal signals, and better prioritize genes, researchers use estimated LD between signals to probabilistically fine-map TWAS findings. Two such methods are TWAS-FOCUS [48] and FOGS [49] (Table 1). There is an updated version of TWAS-FOCUS, which implements multi-ancestry in TWAS fine-mapping (MA-FOCUS) [64].
Also, SMR provides the heterogeneity in dependent instruments (HEIDI) test. It is a goodness of fit for the causal model (smaller HEIDI p values indicate a poor fit for the causality assumption). Thus, unlike commonly used primary TWAS methods, which do not test for causality, SMR can be directly used to fine-map TWAS signals.
Fine-Mapping Based on the Colocalization of eQTL and GWAS Signals
To eliminate the confounding due to LD and improve the detection of more likely causal loci, researchers have also developed methods for colocalization [53]. They use eQTL and trait GWASs to probabilistically assess whether variants are likely causal for both the molecular mediator (expression of the gene under investigation for TWAS) and the outcome. This approach can also be applied to proteomics by substituting pQTL/protein abundance for eQTL/gene expression. Among the most well-known colocalization methods, we can list Coloc [52], eCAVIAR [53], and enloc [54].
Transcriptome Reference Repositories with Human Tissues and Cell Lines
For computing the weights for SNPs used in gene expression prediction models in various tissues, analytical tools require a reference transcriptome from public repositories (Table 2). These repositories have both genomic (DNA sequence or genotype) and transcriptomic (RNA sequence or array) information either at the bulk tissue [8] or single-cell level [65] from the same cohort (Fig. 6).
Biorepositories have human tissue level or single-cell RNA transcript levels and genotype data
Repository or study . | Samples . | No. of donors . | Reference . | Genotype . | Transcript . | No. of distinct eGenes or probes . |
---|---|---|---|---|---|---|
Population based | ||||||
eQTLGen | Peripheral blood | 31,684 | Võsa et al., [21] 2021 | SNP array | Expression array and RNA-seq | 19,942 |
CAGE | Peripheral blood | 2,765 | Lloyd-Jones et al., [66] 2017 | SNP array | Expression array | 36,778 |
ALSPAC | 159 placenta tissues | 869 | Bryois et al., [67] 2014 | SNP array | Bulk tissue RNA-seq | 3,534 |
GTEx version 8 | 47 different tissues and 2 cell lines | 838 | Aguet et al., [8] 2020 | WGS | Bulk tissue RNAseq | 143 trans/23,268 cis |
Geuvadis | Lymphoblastoid cell lines | 462* | Lappalainen et al., 2013 | SNP array | Bulk cellRNAseq | 13,703 |
MuTHER | Subcutaneous adipose/lymphoblastoid cells/skin | 166/156/160 | Nica et al., [68] 2011 | SNP array | Whole genome expression array | 1,822 |
Case-control design | ||||||
PEC | Bulk and single cell | 1,387** | Wang et al., [69] 2018 | SNP array and WGS | Single cell and bulk tissue RNA-seq | 15,626 |
DGN | Peripheral blood | 463 cases with MDD/459 controls | Battle et al., [70] 2014 | SNP array | Bulk tissue RNAseq | 8,208 |
Brain-eMeta version 1 | Brain tissues | 1,194*** | Qi et al., [71] 2018 | SNP array/sequencing | Expression array and RNA-seq | 28,538 |
Repository or study . | Samples . | No. of donors . | Reference . | Genotype . | Transcript . | No. of distinct eGenes or probes . |
---|---|---|---|---|---|---|
Population based | ||||||
eQTLGen | Peripheral blood | 31,684 | Võsa et al., [21] 2021 | SNP array | Expression array and RNA-seq | 19,942 |
CAGE | Peripheral blood | 2,765 | Lloyd-Jones et al., [66] 2017 | SNP array | Expression array | 36,778 |
ALSPAC | 159 placenta tissues | 869 | Bryois et al., [67] 2014 | SNP array | Bulk tissue RNA-seq | 3,534 |
GTEx version 8 | 47 different tissues and 2 cell lines | 838 | Aguet et al., [8] 2020 | WGS | Bulk tissue RNAseq | 143 trans/23,268 cis |
Geuvadis | Lymphoblastoid cell lines | 462* | Lappalainen et al., 2013 | SNP array | Bulk cellRNAseq | 13,703 |
MuTHER | Subcutaneous adipose/lymphoblastoid cells/skin | 166/156/160 | Nica et al., [68] 2011 | SNP array | Whole genome expression array | 1,822 |
Case-control design | ||||||
PEC | Bulk and single cell | 1,387** | Wang et al., [69] 2018 | SNP array and WGS | Single cell and bulk tissue RNA-seq | 15,626 |
DGN | Peripheral blood | 463 cases with MDD/459 controls | Battle et al., [70] 2014 | SNP array | Bulk tissue RNAseq | 8,208 |
Brain-eMeta version 1 | Brain tissues | 1,194*** | Qi et al., [71] 2018 | SNP array/sequencing | Expression array and RNA-seq | 28,538 |
These bio-specimens are sometimes obtained from multiple discovery cohorts of donors.
CAGE, Consortium for the Architecture of Gene Expression; GTEx, genotype tissue expression; GEUVADIS, Genetic European in Health and Disease; ALSPAC, Avon Longitudinal Study of Parents and Children; MuTHER, Multiple tissue Human expression resource; PEC, PsychENCODE; DGN, Depression Genes and Networks study.
*462 samples in eQTL discovery.
**eQTL analysis is a subset of donors from GTEx, BrainSpan, CommonMind, CommonMind-HBCC, BrainGVEX, Lieberman schizophrenia control, BipSeq, UCLA-ASD, and Yale-ASD studies.
***Effective sample size of eQTL meta-analysis based on a subset of donors from GTEx brain, CMC, and ROSMAP.
Reference transcriptome biorepositories have large data on both genotype and the transcript-level measurements from the postmortem tissue donors.
Reference transcriptome biorepositories have large data on both genotype and the transcript-level measurements from the postmortem tissue donors.
Among various repositories, the most comprehensive is GTEx [8]. In version 8, GTEx contains data from 52 different human tissues and 2 cell lines from 948 individuals. Among all, 67.15% of these individuals are female, 84.6% are of European and 12.9% are of African ancestry [8, 72‒74]. Genotype data were obtained from 838 subjects. Other relatively larger eQTL repositories are PsychENCODE [60], eQTLGen [21], CAGE [66], Geuvadis [7], ALSPAC [67], MuTHER [68], DGN (case-control study) [70] and a meta-analysis of brain tissue eQTLs (Brain-eMeta) [71].
PsychENCODE combines data from participating research groups under one data repository for ease of data sharing [60, 69]. Their web portal includes RNA sequencing data from postmortem bulk brain tissue and single cells, as well as genotype data.
eQTLgen is the largest meta-analysis of 37 (peripheral) blood eQTL studies [21]. Cis-eQTL detection yielded 16,987 eGenes, which were replicated in an independent eQTL mapping study on single cells from whole blood (monocytes, natural killer cells, CD8+ T cells, and lymphocytes). The majority of cis-eQTL (92%) were not distant (<100 KB) from the transcription start or end site of eGenes. Some of the more distal cis-eQTLs were located in Hi-C regions.
The Consortium for the Architecture of Gene Expression (CAGE) study includes highly related individuals in their analysis [66]. They meta-analyzed the gene expression data on whole blood from five different study cohorts (BSGS, coronary artery disease [CAD] [75], Emory-Georgia tech center for health discovery and well-being [CHDWB] [76], Estonian genome center of the University of Tartu [EGCUT] [77], Morocco [78]). These cohorts were predominantly of European ancestry, except the Moroccan cohort. The CAGE study is included in the eQTLgen consortia study.
One of the earliest large-scale blood eQTL mappings in humans was conducted by the Geuvadis consortium [7]. It contains RNA-seq data for lymphoblastoid cell lines. Geuvadis includes subjects of Yoruba and European descent who participated in the 1000 Genomes project phase 1.
For the brain, brain-eMeta (BrainMeta version 1) meta-analyzed several eQTL studies (GTEx brain tissues, Common Mind Consortium [CMC] [79], GTEx [74], and ROSMAP [80]). (Meta-analysis of cis-eQTL in correlated samples is the method for combining effect estimates [71]). Aside from the study by Siebert et al. [81], it is one of the largest publicly available brain eQTL datasets. In its latest update (BrainMeta version 2), the sample size increased to 2,865 [82] (see online suppl. Table 2 for more detail).
The remaining eQTL repositories, which we briefly describe below, are smaller in terms of the number of donors. Unlike most other studies, the eQTL discovery cohort from ALSPAC reported CNVs as eQTLs [67].
In the multiple tissue human expression resource (MuTHER) study, authors conducted eQTL mapping on cis-regulated gene expression in two different tissues (skin, fat) and 1 cell line (lymphoblastoid) sampled from female twins (age above 40 years) with European descent [68]. They also showed that eQTLs for 12.5% of eGenes are shared among all tissues and for 71.44% of eGenes are relevant to a single tissue.
The Depression Genes and Networks study (DGN) cohort includes case-control individuals with major depression of European ancestry [70]. The age range is between 21 and 60 years. The DGN study also identified the splicing QTLs and allelic expression.
Proteomics: PWASs
DNA is transcribed into mRNA, and then mRNA is translated into protein [83]. Therefore, proteins could be considered molecular mediators between genes and traits. Thus, identifying a group of fluid/tissue-specific proteins in a molecular pathway associated with the risk for psychiatric disorders (PDs) could help researchers better understand the etiology and identify possible drug targets for treatment. The levels of these proteins in the blood serum would also serve as possible biomarkers for the prognosis or diagnosis of the PDs.
Although the biochemical pathway from genes to proteins is causal, the correlation between RNA and protein abundance is not particularly strong in humans [84] and yeast [85]. For instance, only 40% of the variance in the transcriptome was explained by the variance in the proteome of mammalian cells [86]. In a study of human cell lines, 67% of the explained variance in protein abundance was attributable to transcript levels and sequence elements [87]. Similarly, analysis of protein abundance in human lymphoblastoid cell lines showed that many pQTLs are not eQTLs for the same gene [88]. These findings suggest that the regulatory mechanisms in transcript levels and protein abundance are different. Thus, proteome studies provide additional information to the findings at transcriptome level.
Testing the mediator effect of pQTL on traits in PWAS is plausible because (i) it has been shown that the human proteome is under genetic control [89] and (ii) GWAS of protein abundance reported that the common variation has heritable effects on the plasma protein [90]. So, PWAS could employ the same methodological framework used for TWAS. Recent PWAS analyses suggest that 70% of the significant gene-level associations do not yield TWAS signals [91]. It is reasonable to conclude that PWAS complements TWAS findings.
PWAS Methods
Most PWAS methods mimic TWAS methods [13, 14, 36]. The main difference between the TWAS and PWAS paradigms is that the estimation of SNP weights in PWAS is now based on a reference proteome study. With more blood and brain pQTL reference data available, researchers conducted PWAS on the PD GWAS [91, 92]. Unfortunately, there are still very few pre-computed weights to conduct PWAS. Predictive weights for PWAS analysis are available for the brain [92] and blood [22]. Researchers might apply MR methods, such as SMR, that do not require computing weights.
Reference Proteome Repositories
Reference proteome datasets are large-scale proteome studies applying various protein assay methods that detect and quantify proteins in human solid tissues or plasma. Mass spectrophotometry methods are considered the gold standard for measuring protein abundance and identifying proteins in humans [93]. Other methods include slow off-rate modified aptamer [94] and Olink [95] assay technology. The majority of the large-scale pQTL mapping studies implement the aptamer-based technology to assay proteins (Table 3) (see online suppl. Table 3 for more detail).
Selected large-scale pQTL discovery studies to identify associations between protein abundance and genotype
Cohorts . | Samples . | No. of donors . | Reference . | Genotype . | Proteome technology . | No. of proteins . |
---|---|---|---|---|---|---|
Population-based study | ||||||
KORA/QMDiab studies | Blood plasma | 997/338 | Suhre et al., [96] 2017 | SNP array | SOMAscan | 1,124 |
INTERVAL study | Blood plasma | 3,301 | Sun et al., [4] 2018 | SNP array | SOMAscan | 2,994 |
AGES cohort 1 | Blood serum | 5,457 | Emilsson et al., [30] 2018 | SNP array | SOMAscan | 4,137 |
ROS/MAP and Banner Sun Health | dPFC | 330 and 149 | Robins et al., [11] 2021 | WGS | TMT isobaring labeling MS | 7,376 |
AGES cohort 2 | Blood serum | 5,368 | Gudjonsson et al., [97] 2021 | SNP array | SOMAscan | 4,782 |
AGES Reykjavik cohort | Blood serum | 5,343 | Emilsson et al., [98] 2022 | SNP array | SOMAscan | 1,942 |
Fenland cohort | Blood plasma | 10,708 | Pietzner et al., [20] 2021 | SNP array | SOMAscan | 4,775 |
ARIC | Blood serum | 9,084* | Zhang et al., [22] 2022 | SNP array | SOMAscan | 871** |
deCODE | Blood serum | 35,559 | Ferkingstad et al., [99] 2021 | SNP array/WGS | SOMAscan | 4,719 |
Case-control studies | ||||||
Cases with PDs and control*** | CSF | 133 | Sasayama et al., [12] 2017 | SNP array | SOMAscan | 1,126 |
Individuals with neurological disorders | CSF/blood plasma/parietal lobe cortex | 835/529/380 | Yang et al., [100] 2021 | SNP array | SOMAscan | 713/931/1,079 |
Cohorts . | Samples . | No. of donors . | Reference . | Genotype . | Proteome technology . | No. of proteins . |
---|---|---|---|---|---|---|
Population-based study | ||||||
KORA/QMDiab studies | Blood plasma | 997/338 | Suhre et al., [96] 2017 | SNP array | SOMAscan | 1,124 |
INTERVAL study | Blood plasma | 3,301 | Sun et al., [4] 2018 | SNP array | SOMAscan | 2,994 |
AGES cohort 1 | Blood serum | 5,457 | Emilsson et al., [30] 2018 | SNP array | SOMAscan | 4,137 |
ROS/MAP and Banner Sun Health | dPFC | 330 and 149 | Robins et al., [11] 2021 | WGS | TMT isobaring labeling MS | 7,376 |
AGES cohort 2 | Blood serum | 5,368 | Gudjonsson et al., [97] 2021 | SNP array | SOMAscan | 4,782 |
AGES Reykjavik cohort | Blood serum | 5,343 | Emilsson et al., [98] 2022 | SNP array | SOMAscan | 1,942 |
Fenland cohort | Blood plasma | 10,708 | Pietzner et al., [20] 2021 | SNP array | SOMAscan | 4,775 |
ARIC | Blood serum | 9,084* | Zhang et al., [22] 2022 | SNP array | SOMAscan | 871** |
deCODE | Blood serum | 35,559 | Ferkingstad et al., [99] 2021 | SNP array/WGS | SOMAscan | 4,719 |
Case-control studies | ||||||
Cases with PDs and control*** | CSF | 133 | Sasayama et al., [12] 2017 | SNP array | SOMAscan | 1,126 |
Individuals with neurological disorders | CSF/blood plasma/parietal lobe cortex | 835/529/380 | Yang et al., [100] 2021 | SNP array | SOMAscan | 713/931/1,079 |
We grouped the studies in two categories: population based and case-control studies.
ARIC, Atherosclerosis Risk in Communities; SOMAScan, slow off-rate modified aptamer scan; KORA, Cooperative Health Research in the Augsburg Region; QMDiab, The Qatar Metabolomics Study on Diabetes; CSF, cerebro-spinal fluid; ROS, Religious Orders Study; MAP, Memory and Aging Project; dPFC, dorsolateral prefrontal cortex; TMT, tandem mass tag; MS, mass spectrometry; AGES, The Age, Gene/Environment Susceptibility-Reykjavik Study.
*After QC steps, 1,871 and 7,213 individuals with African American and European ancestry, respectively.
**Number of proteins assayed with both Olink and SOMAscan v4.
***Sixty-six percent of participants have psychiatric condition, 28 schizophrenia, 15 bipolar disorder, and 45 major depressive disorder.
The development in proteome technology allowed high-throughput analysis of proteins in large cohorts [4, 18]. Initially, the majority of studies used blood serum to identify pQTL [4, 17, 28, 97]. However, assays on blood specimens might not be fully representative of other organ tissues, such as the brain. So, after the initial pQTL discoveries in blood, researchers also investigated cerebrospinal fluid (CSF) [12, 100], postmortem tissues [101], and the brain [11].
In addition to the abovementioned studies, there are other smaller pQTL studies on different tissues, such as the liver [9] and human induced pluripotent cell lines [10]. For example, the GTEx consortium analyzed the human proteome in 32 different tissues [101]. However, the sample size (n = 14) is rather low compared to the large-scale serum proteome analysis [4, 97].
The mapping of pQTLs in three different tissues showed that tissue-shared pQTLs are more likely to be cis-pQTLs [100]. Thus, it is likely that some blood pQTLs, especially cis-pQTLs, are shared with brain tissues, supporting the idea of searching for brain biomarkers of PDs in plasma. Also, it is known that microglia in the central nervous system are immune cells and there are some shared pQTLs between the CNS and CSF [102]. For these reasons and due to their physical proximity, CSF might be viewed as a better pQTL proxy than blood for the under-sampled brain tissue.
Discussion
GWAS identifies a large number of loci associated with the risk for PDs in well-powered studies. However, these findings alone do not provide enough information on the molecular underpinnings of PDs. We better understand this molecular etiology by integrating GWAS trait information with molecular endophenotypes [103], which mediate the association between the genome and traits. Here, we focus on two well-known mediators, i.e., transcript and protein levels. Although proteomes and transcriptomes are not investigated in most GWAS cohorts, transcript and protein abundance can be predicted from genotype or GWAS summary statistics using the TWAS and PWAS paradigms. In this review, we summarized (i) available TWAS/PWAS methods and (ii) large publicly available eQTL and pQTL repositories that serve as the reference transcriptome/proteome for the TWAS/PWAS analysis. As a guide for researchers, we further categorized TWAS/PWAS methods by their intended end use, i.e., (i) primary, (ii) fine-mapping, and (iii) pathway-aware tools (Table 1). This work would help researchers choose the TWAS/PWAS methods that are better suited for their analyses.
TWAS/PWAS are powerful paradigms because they model molecular mediators and add biological context to the GWAS findings. In addition to this, TWAS/PWAS also greatly reduce the burden of multiple testing. For instance, TWAS/PWAS only pay the multiple testing penalty for the number (∼20 K) of expressed genes/proteins and not the number (>5 M) of variants [14]. Reducing the number of tests would likely increase the power of detecting signals.
Researchers could uncover a larger number of TWAS/PWAS signals whenever large transcriptome and proteome reference datasets are available for numerous (and relevant) tissues. When such datasets are smaller, as in brain-related tissues, a meta-analysis of TWAS signals across tissues might help increase the discovery of signals. This approach yielded both previously identified and novel loci for nicotine addiction when meta-analyzing TWAS statistics from 13 different brain regions [104].
There are many examples of genes that were prioritized by TWAS as likely causal for PDs. For instance, ADH1B, which codes for alcohol metabolizing enzyme, was found to be significant in problematic alcohol use S-MultiXcan analysis using all GTEx tissues [105]. In the recent PGC3 SCZ GWAS, one of the genes prioritized by brain TWAS was RERE. It was discussed as a putatively causal gene that is biologically relevant to the SCZ pathology [106]. Functional genetic studies assessing effects of SCZ risk variants on chromatin accessibility have shown that some of these variants have a cis-eQTL effect on RERE [107]. For BIP, twenty-two genes were included in the credible gene set based on TWAS-FUSION and FOCUS results. The most significant gene was DCLK3 in brain TWAS (using reference eQTL PsychENCODE brain as transcriptomic reference) [108]. A recent major depression study conducted brain PWAS [109] using TWAS-FUSION and SMR [110, 111]. This study identified 24 proteins that were reported as significant by SMR, HEIDI, and FUSION analyses. The strongest signal was for B3GLCT that was thought to play a role in synaptogenesis [109].
Although reference transcriptome and proteome datasets are still underpowered for certain tissues, e.g., the brain, we expect a significant increase in their sample size and cell type specificity over time. For instance, consecutive GTEx iterations substantially increased the reference transcriptome size [8, 72‒74]. There are efforts to discover eQTL at the single cell level that serve as transcriptome reference data [65, 112].
Ancestrally diverse reference cohorts will help scientists (i) identify population-specific eQTLs in transcriptome data [8] and (ii) increase cross-population generalizability of eQTLs [113]. Also, ancestral diversity will improve the detection of TWAS/PWAS signals because this results in better accuracy and specificity when imputing the transcriptome and proteome.
Progress in mapping eQTL and pQTL in non-European ancestral populations is still in its early phases. In Atherosclerosis Risk in Communities (ARIC), the African American cohort was part of the human serum pQTL mapping study [22]. In another study, researchers developed a method to map eQTL in a cohort of individuals with non-European ancestry, including individuals of African American and Hispanic ancestry [114]. The same research group also mapped pQTL in cohorts from populations of African American, Chinese, and Hispanic ancestry [115]. These studies are still limited in terms of sample size.
Limitations and Issues Associated with the TWAS and PWAS Methods
- 1.
Primary TWAS/PWAS methods only test for association, but not causality. For improved fine-mapping, such methods need to be followed by fine-mapping/colocalization analyses.
- 2.
Primary non-MR TWAS methods may yield false-positive rates [116].
- 3.
The brain eQTL and brain pQTL datasets include some individuals with certain neurological and PDs.
- 4.
Since the sample size for brain pQTL reference datasets is smaller than that for the blood pQTL, PWAS signal detection and resolution suffers from low sample size.
- 5.
TWAS methods, including SMR, cannot eliminate horizontal pleiotropy.
- 6.
Reference eQTL/pQTL LD data that are used in TWAS/PWAS might not be representative of the LD from the GWAS cohort.
- 7.
The overlap between GWAS and variants in TWAS prediction model can be limited due to the fact that the two approaches target different types of variants. This might limit TWAS’s detection power [117].
Conclusion
We believe that applying TWAS and PWAS tools to GWAS summary statistics greatly helps researchers identify risk genes for PDs. Applied researchers might find it more effective to use SMR as the default method for TWAS and PWAS because it eliminates the need for pre-computing SNP weights every time the reference cohort or prediction method updates. Moreover, SMR can assist in the process of signal fine-mapping by providing the HEIDI test to identify likely causal risk loci.
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
Funding Sources
Research in this work was funded by AA022537 (Huseyin Gedik, Brian P. Riley, and Silviu-Alin Bacanu), R01MH118239, and R01DA052453 (Vladimir I. Vladimirov and Silviu-Alin Bacanu) and MH125938, MH126358, NARSAD grant 28632 (Roseann E. Peterson).
Author Contributions
The review was conceptualized and written by Huseyin Gedik, Roseann E. Peterson, Brien P. Riley, Vladimir I. Vladimirov, and Silviu-Alin Bacanu. Throughout all the steps, Vladimir I. Vladimirov and Silviu-Alin Bacanu supervised the investigation.
Data Availability Statement
Some of the datasets referred to in this review are publicly available. The URLs can be found in the supplementary tables and they can be downloaded from HG’s GitHub: https://github.com/huseyingedik/A-review-of-integrative-post-GWAS-analyses-relevant-to-psychiatric-disorders-imputing-transcriptome/raw/main/Supplementary_tables.xlsx. PrediXcan predictive weight database: https://predictdb.org/ (accessed on 10/20/2022). TWAS-FUSION prediction model database: http://gusevlab.org/projects/fusion/ (accessed on 11/02/2022). mashr: https://github.com/stephenslab/mashr (accessed on 11/05/2022). Further inquiries can be directed to the corresponding author.