Abstract
Objectives: There is evidence to suggest that asthma pathogenesis is affected by both genetic and epigenetic variation independently, and there is some evidence to suggest that genetic-epigenetic interactions affect risk of asthma. However, little research has been done to identify such interactions on a genome-wide scale. The aim of this studies was to identify genes with genetic-epigenetic interactions associated with asthma. Methods: Using asthma case-control data, we applied a novel nonparametric gene-centric approach to test for interactions between multiple SNPs and CpG sites simultaneously in the vicinities of 18,178 genes across the genome. Results: Twelve genes, PF4, ATF3, TPRA1, HOPX, SCARNA18, STC1, OR10K1, UPK1B, LOC101928523, LHX6, CHMP4B, and LANCL1, exhibited statistically significant SNP-CpG interactions (false discovery rate = 0.05). Of these, three have previously been implicated in asthma risk (PF4, ATF3, and TPRA1). Follow-up analysis revealed statistically significant pairwise SNP-CpG interactions for several of these genes, including SCARNA18, LHX6, and LOC101928523 (p = 1.33E–04, 8.21E–04, 1.11E–03, respectively). Conclusions: Joint effects of genetic and epigenetic variation may play an important role in asthma pathogenesis. Statistical methods that simultaneously account for multiple variations across chromosomal regions may be needed to detect these types of effects on a genome-wide scale.
Introduction
Asthma is a chronic inflammatory disorder of the airways, characterized by airway hyperresponsiveness and airflow limitation. Asthma prevalence has increased in recent years [1], afflicting 300 million worldwide, and is highly variable across population centers, ranging from 1 to 18% [2]. In the United States, more than 23 million people were diagnosed with asthma as of August 2015, including over 6 million children [3].
Numerous genome-wide association studies have shown a substantial genomic contribution to the etiology of asthma, with heritability estimates varying between 35 and 95% [4-6]. However, only a fraction of this variation has been explained by specific causal variants, such as those near Chr17q12–21 [7, 8]. Epigenetic mechanisms, another source of disease variation, may explain a portion of this missing heritability [9-11]. Epigenomic alterations, while heritable, may also occur as a response to the endogenous and exogenous environment, and contribute to asthma pathogenesis [9, 12].
There is evidence to suggest that genetic and epigenetic variation interact synergistically to affect gene expression [13]. In fact, interactions between SNPs and DNA methylation in the genomic regions of T-helper 2 pathway genes IL4R and others were found to affect asthma risk [14, 15], possibly through expression of these genes.
Methods for detecting statistical interactions in population studies often involve linear or logistic models that include a product term to represent the interaction effect. These pairwise approaches suffer from a multiple testing challenge [16, 17], which is particularly severe if applied on a genome-wide scale. For example, in a study with 1 million SNPs and 450,000 CpG sites, 450 billion tests would be required. In addition, pairwise approaches may not optimally leverage information from dependencies between SNPs and CpG sites across contiguous genomic regions. Using case-control data, we applied a novel gene-centric approach to test for interactions between multiple SNPs and CpG sites simultaneously in the neighborhood of each gene across the whole genome.
Methods
Asthma BRIDGE Data
These data are from the Asthma BioRepository for Integrative Genomic Exploration (Asthma BRIDGE) [18-21], which is publicly accessible and includes 1,542 individuals with asthma and controls with comprehensive phenotype and genomic data. Asthma was defined via a questionnaire by the presence of asthma symptoms, use of an inhaled bronchodilator at least twice per week, or use of a daily asthma medication for the 6 months before the screening interview. Genome-wide SNP and DNA methylation was obtained from whole blood samples from 576 participants. Samples were randomized to avoid confounding by experimental batch. The final data set included 356 asthmatic and 220 nonasthmatic adults, for whom we had complete data.
Genotyping and DNA Methylation
Genotyping methods are described in detail in Torgerson et al. [19]. Briefly, SNPs included all phase 2, release 21 consensus HapMap variants and were obtained from studies participating in the EVE consortium [19, 22]. The genotyping platforms, Illumina (1Mv1, 550k, 610k, 650k) and Affymetrix (500k, 6.0), varied by sample source. Quality control procedures included filtration for call rates (> 95%) and tests for agreement with Hardy-Weinberg expectations applied to SNPs oriented to the plus strand. High-density imputation using the 1000 Genomes reference panel was performed with the MACH software [23]. Data were checked for consistency in reference alleles and strand orientation, and SNPs were excluded based on low quality scores and examination of QQ plots comparing the distribution of association p values for genotyped and imputed markers. SNPs with minor allele frequencies less than 0.05 were excluded.
A total of 11 plates were run on the Illumina human 450K methylation platform. Data preprocessing steps included the application of norm-exponential background correction on the raw methylation data, plate by plate for all 11 plates. HG19 annotations, including gene symbols, chromosome number, and sequence position, were added via the Bioconductor package, FDb.InfiniumMethylation.hg19. One sample from a nonasthmatic participant was excluded due to having an extreme value of 1.0 for all probes. Methylation of CpG sites were assessed for outliers, and observations exceeding 3 interquartile ranges from the 25th and 75th percentiles, per Tukey’s outer fences [24], were removed unless they comprised greater than or equal to 5% of the data for that variable. The 11 norm-exponential corrected plates were combined, and 4 additional samples were excluded as outliers. Dye bias correction was done using one sample as reference for all others, and quantile normalization was applied to the resulting data set.
Canonical Analysis of Set Interactions
Here we describe and apply a new method, canonical analysis of set interactions (CASI), designed to identify statistical interactions between high-order sets of features in a simultaneous approach. In contrast to direct interactions in which DNA sequence variation affects DNA methylation, we are seeking to identify genomic regions underlying genes where there is synergism between the effects of SNPs and methylation on asthma suscepti-bility.
CASI is described in detail in the Appendix Methods and is freely available to the public in a downloadable R package (https://github.com/USCbiostats/CASI). Briefly, it is a hypothesis testing approach to detect differences in correlations between linear combinations of SNPs and CpG sites in cases versus controls. In this study, SNPs and CpG sites were mapped to the nearest gene, thus defining the feature sets, with positions identified using the UCSC genome browser for all RefSeq genes. Overlapping isoforms of the same gene were combined to form a single full-length version. SNPs and CpG sites residing in flanking regions extending 2 kb on either side were included. A total of 18,178 gene-centric regions were identified across the genome. Using canonical correlation analysis (CCA), coefficients that maximize the correlation, the canonical correlation, between linear combinations of the SNPs and CpG sites are estimated in cases. The same coefficients are then used to compute linear combinations in controls. The CASI statistic is based on the scaled difference between Fisher transformed canonical correlation coefficients in cases versus controls. Estimation is performed in cases because it leverages the superior statistical power of the case-only interaction analysis approach [25]; dependencies among predictive variables in cases may indicate multiplicative interactions in their effects on disease [26, 27]. Due to the limitations of CCA in the number of variables relative to the sample size, sparse CCA as described by Witten et al. [28] is applied for sets with dimensions that are too large to be accommodated by conventional CCA. The parametric distribution of this test statistic, the scaled difference between cases and controls, is unknown. Therefore, we use a permutation-based procedure to estimate the null distribution by randomly permuting case status and computing the test statistic. The permutation analysis is conducted repeatedly until a sufficient number of values have been generated under the null. Rather than compute p values, a computationally efficient false discovery rate (FDR) approach is used that requires few permutations and yields confidence intervals that account for dependencies among tests as well as the number of permutations conducted [29]. For this study, 100 permutation analyses were conducted, and statistical significance was defined by an FDR threshold of 0.05.
Follow-Up Analyses in Significant Regions
SNPs and CpG sites can be ranked in their contribution to the test statistic by computing correlations, loadings, in cases with their corresponding linear combinations, the canonical variates. SNP-CpG site pairs that were the greatest contributors to statistically significant tests were identified by selecting all SNP-CpG pairs among those with loadings greater than 0.5 within a gene region. The loading threshold of 0.5 has been proposed as the “operational definition” of a large effect size [30]. Using logistic regression, conventional likelihood ratio tests of multiplicative interactions were conducted, including one-degree of freedom likelihood ratio tests for interaction and three-degree of freedom tests for the combined main and interaction effects. Statistical significance was assessed according to FDR level of 0.05, using a parametric version of Millstein and Volfson [31]. Variables found to be significantly correlated with asthma status (Table 1) were included as adjustment covariates in the logistic models. While the education level was not significant (p = 0.14), it was nevertheless included due to prior evidence of association with asthma. Final models were adjusted for gender, ancestry, family history of asthma, age at sample collection, and education level. The ancestry variables were composed of the top two principal components computed by the software Eigensoft [32] using 128 ancestry informative markers [33]. The primary results do not include adjustment for site, because two ABRIDGE sites contributed only cases (Table 1). However, sensitivity analyses were conducted by restricting to sites that included both cases and controls and including site as an adjustment covariate, and the results did not change substantially (Appendix Table A1).
Results
Performance of the CASI Approach in Simulation Studies
The simulation results demonstrate that the CASI hypothesis test is sensitive to correlation differences between cases and controls and that sensitivity is associated with magnitude of the differences (Appendix Fig. A1). We also found that CASI is a consistent test with asymptotics depending on effect size and that type I error is very low as the effect size approaches zero (Appendix Fig. A2). CASI performed quite well in the simulation results as compared to other related approaches, especially in the vicinity of conventional levels of FDR that imply statistical significance (Appendix Fig. A3). Applying a conventional pairwise logistic regression approach combined with FDR-evaluated significance to the same simulated data resulted in a complete failure to detect interactions, highlighting the utility of the proposed approach (Appendix Fig. A3).
Demographics of the Asthma BRIDGE Population
Of the 576 participants, 62% had asthma (Table 1). While males and females were evenly distributed among cases, there were more females than males among the controls. Asthma cases had a greater percentage of Hispanics whereas controls had a greater proportion of African Americans. Frequency of asthma prevalence varied by clinic site by design, with some sites recruiting only cases.
CASI Analysis of Asthma BRIDGE Data
Note that p values were not generated in the main analysis even as intermediate statistics because FDR was estimated directly from the observed and permuted CASI statistics. The estimated FDR and corresponding confidence intervals (CIs) for a series of increasingly stringent significance thresholds for the CASI test statistic demonstrate a downward trend with narrow CIs (Fig. 1), indicating that the CASI statistic is informative with respect to distinguishing observed from permuted data. However, this dynamic does not continue beyond about 4.35 where a minimum FDR occurs, implying that more extreme values of the CASI statistic do not correspond to lower rates of false discoveries. The threshold of 4.35 was therefore used to identify the most statistically significant results, yielding 12 genes: HOPX, SCARNA18, PF4, STC1, ATF3, OR10K1, UPK1B, LOC101928523, LHX6, CHMP4B, TPRA1, and LANCL1 (FDR = 0.050, CI = 0.024–0.104). Though we highlight these 12 genes due to low FDR, it is clear from the tight CIs bracketing FDR estimates at more permissive significance thresholds that substantially more such interactions are implied by these results.
Pairwise SNP-CpG Interaction Analysis
Pairwise analyses of SNPs and CpG sites from the 12 significant genomic regions with loadings of 0.5 or greater using logistic regression revealed evidence of pairwise interactions for 19 top-loading pairs (Table 2), corresponding to 4 of the 12 genomic regions, i.e., LOC101928523, SCARNA18, LHX6, and STC1.
Although the three-degree of freedom tests were significant for most of these pairs, the one-degree of freedom tests tended to be more significant, implying that main effects were not appreciably contributing. Nine individual SNPs and 10 CpG sites in LHX6, 3 SNPs and 2 CpG sites in LOC101928523, and 3 SNPs and 2 CpG sites in SCARNA18 met the criteria of having loadings greater than 0.5 (Fig. 2). The SNP-CpG pair with the greatest loadings did not always elicit a significant result for multiplicative interaction (Appendix Table A2); however, this is not surprising considering that the CASI statistic is formed from linear combinations of SNPs and CpGs. Multiple genomic regions, HOPX, PF4, ATF3, UPK1B, CHMP4B, TPRA1, and LANCL1, did not show evidence of interactions for their top loading SNP-CpG pair (Appendix Table A2). The lack of statistical significance for individual pairs may indicate that it is necessary to evaluate joint interactions between multiple SNPs and CpGs in order to have adequate power to detect the effects.
Additional pairwise interaction analyses were conducted for SNPs and CpGs with loadings greater than 0.5 underlying LHX6, LOC101928523, and SCARNA18 (Appendix Tables A3 and A4). Significant interactions were found for LHX6, where involved SNPs had negative loadings located at the center haplotype block interacting with CpG sites clustered to the right in a CpG island (Fig. 2; Table 3). Odds ratios (ORs) for the main effects of both SNPs and CpGs in LHX6 were not statistically significant (Table 3). However, interactions for 7 SNP-CpG pairs were significant, with ORs ranging from 0.53 (0.36–0.77) to 0.84 (0.74–0.96) (Table 3). High linkage disequilibrium (LD) between SNPs as well as dependencies among CpGs may explain the similarity in interaction effects that is apparent across SNP-CpG pairs. For LHX6, minor allele dose was associated with increasingly protective effects of methylation (Table 3). This trend is particularly apparent in the conditional ORs (ORmeth|SNP). For the most significant interaction in LHX6, individuals with the common GG genotype in rs10818651 had 9% greater odds of asthma for a 5% difference in cg21469772 methylation. Addition of 1 minor allele A reverses the association to a 21% decrease in the odds of asthma for 5% greater methylation, whereas individuals with AA had a 43% decrease. Similarly, for the next most significant interaction (rs10985567 and cg21213617), the odds of asthma for an individual with 5% higher methylation and GG (common homozygote) increases by 24% whereas the odds for an individual with 5% higher methylation and AA decreases by 66%.
In LHX6, the most statistically significant interaction involved the SNP with the greatest loading (rs10818651, loading = –0.86) (Table 2) that resides within the middle LD block (between base pair locations 124972042 and 124982500, indicated by vertical bars in Fig. 2), paired with CpG site cg01363324 (loading = 0.53). The opposite signs of the loadings for this pair (Fig. 2) convey that these variables are negatively correlated among cases (Fig. 3), which is reflected in the interaction pattern. Increasing minor allele dose of the SNP reflects an association between methylation level and asthma that is increasingly negative.
Boxplots of methylation values with jittered points can provide a visual demonstration of relationships underlying the significant interaction effects (Fig. 3). For example, LHX6 median methylation (cg04282082) is greater in individuals with asthma than controls with the GG genotype (rs10818651), but less within AG individuals. In general, for LHX6, increased methylation was protective in the presence of a minor allele (Table 3). In another example, median methylation of cg16688533 in STC1 is greater in individuals with asthma versus those without in AA individuals (rs9969426), but those with GG tend to be unmethylated and have asthma. Another clear example can be observed for LOC101928523, where median methylation (cg07956857) is greater in CC individuals with asthma (rs13301641), approximately equal in TC individuals but less in TT.
SNPs within the LOC101928523 genomic region were confined to a single haplotype block (Fig. 2). The highest loading SNPs, rs13301641 (loading = 0.87), rs11792474 (loading = 0.87), and rs75088949 (loading = –0.68) as well as the highest loading CpG site, cg07956857 (loading = –0.89) were the most statistically significant interactions for this genomic region (Appendix Table A3). As with LHX6, there was little evidence of main effects; however, unlike LHX6, minor allele dose was accompanied by both decreased and increased ORs. The most significant interaction for LOC101928523, between rs13301641 and cg07956857, involved loadings with opposite signs (Fig. 2), which reflected negative correlation between methylation level and number of minor alleles among individuals with asthma.
For SCARNA18, there were 3 SNPs, rs67216017, rs113665237, and rs10061690, all in high LD, with loadings greater than 0.5. Thus, interaction analysis results are indistinguishable among the 3 (Appendix Table A4). The loadings for SNPs and CpGs are concordant (Fig. 2), which is consistent with the idea that minor allele dose is associated with increasingly deleterious effects of methylation on asthma susceptibility. The range of methylation for cg14999833 is narrow (Fig. 3), hence it is appropriate to estimate the effect over a 1% change. Deletion of allele “A” near rs67216017 or “T” near rs113665237 appears to have an equivalent effect to presence of minor allele “A” at rs10061690 on the relationship between methylation at the CpG site cg14999833 and asthma. With respect to cg14999833, a biologically meaningful effect is observed for a 1% difference in methylation in contrast to cg20697188 where a 5% difference is within the range of the observed data.
Discussion
The proposed statistical approach, CASI, was able to overcome the substantial challenges imposed by high dimensional data to identify statistical interactions between methylation variation at CpG sites and DNA variation in SNPs. Part of the explanation for the adequate power that led to this success as well as the superior performance of CASI in simulated data in comparison to other similar methods is the fact that CASI leverages the concept underlying the case-only interaction method, known to be a powerful approach [25, 34]. This characteristic is unique to CASI among related set-based methods designed to detect statistical interactions [35-37]. Another unique characteristic of the CASI approach is the reliance on a computationally efficient permutation-based FDR estimator to define statistical significance. This is a nonparametric approach that is robust to feature distributions and yields CIs that bracket the FDR estimates, allowing uncertainty in those estimates to be quantified. These CIs play an especially important role in studies of weak effects where an FDR level of 0.05 may not be achievable.
A limitation of the CASI procedure is the inability to explicitly adjust for potential confounding covariates. However, this problem was mitigated in follow-up analyses by applying conventional logistic regression with covariates to pairs of features implied by CASI results. Like many conventional approaches, CASI is designed to detect linear associations, which limits the power for nonlinear relationships. However, considering that we were able to identify 12 genes with evidence of interactions using a sample size in the hundreds suggests that more interactions may be detectable if the sample size is increased substantially. Though the procedure requires considerable computational resources due to its reliance on permutations, the required number of permutations is small enough to allow genome-wide applications.
Though several previous studies have investigated statistical interactions between genetic and epigenetic variation [13-15, 38], we know of no others that have attempted a genome-wide assessment. CASI identified 12 genomic regions defined by the positions of 12 genes with evidence of statistical interactions between sets of SNPs and sets of methylated CpGs with respect to asthma risk. Of these 12 genes, 3 have previously been implicated in asthma risk or underlying biological pathways related to pathology of the disease, i.e., PF4 [39], ATF3 [40, 41], and TPRA1 [42]. Three genomic regions not previously implicated in asthma, LOC101928523, SCARNA18, and LHX6, contained the most statistically significant pairwise interactions between the considered individual SNPs and CpGs.
LHX6 (LIM homeobox domain 6) has not been implicated in asthma; however, it is a recognized transcriptional regulator that controls the differentiation and development of lymphoid cells [43]. LHX6 is known to be regulated epigenetically in lung cancer, where in vitro and in vivo studies found that in normal lung tissue it is readily expressed but downregulated or silenced in lung cancer cells in which the gene is hypermethylated [43]. Other evidence of epigenetic regulation in the vicinity of LHX6 has been found in head and neck squamous cell carcinomas [44]. Similar to the lung cancer study, hypermethylation of the CpG island in LHX6 was associated with transcriptional silencing of LHX6. These findings suggest that differential methylation near LHX6 plays a role in lung biology and lends credence to a potential role in asthma.
The 3 genes previously implicated in asthma pathogenesis are PF4, ATF3, and TPRA1. PF4 (Platelet Factor 4) is a protein-coding gene which functions as an inhibitor of T-cell function [45]. PF4 activation in the lung is a feature of the late inflammatory response to antigen challenge and may play an important role in allergic inflammation and asthma [46]. Atf3 (Activating transcription factor 3) is a negative regulator of allergic inflammation in mice challenged with ovalbumin [47] and deficiency in mice leads to the development of significantly increased airway hyperresponsiveness and pulmonary eosinophilia [40]. Significant increases in ATF3 mRNA have also been observed in patients with mild asthma as compared with nonasthmatic patients [48]. TPRA1 (transmembrane protein adipocyte associated 1) is an irritant-sensing cation channel expressed in TRPV1-positive, capsaicin-sensitive chemosensory neurons that innervate various organs, including the airways [49]. Various exogenous chemicals have been described to activate TRPA1, including agents recognized to trigger and/or worsen asthma such as diisocyanates, cigarette smoke, acrolein, and chlorine [42]. A potential role of TRPA1 in mediating allergen-induced asthmatic responses has been described in ovalbumin-sensitized mice, in which genetic deletion of Trpa1 or pretreatment with a selective Trpa1 antagonist reduced leukocyte infiltration, decreased cytokine and mucus production, and almost completely abolished airway hyperactivity [49]. Bessac et al. [50] suggested that TRPA1 may function as an integrator of immunological stimuli modulating inflammation in the airways. Furthermore, chemical irritant-induced activation of TRPA1 may trigger the release of neuropeptides and chemokines in the airways, thereby exacerbating the cellular and tissue inflammatory response observed in allergic individuals [51].
HOPX, OR10K1, UPK1B, CHMP4B, and LANCL1 were identified as significantly associated with asthma in the tests for joint interactions but require further investigation into their putative biological functions. There is scant prior evidence of a role for these genes in pulmonary or immune function, lung disease, or asthma, although HOPX is involved in the function of regulatory T cells [52].
Conclusions
We demonstrated that simultaneous consideration of genomic and epigenomic variation has the potential to identify genetic risk factors for asthma beyond individual genome-wide association studies or epigenetic screens. These results add to existing evidence suggesting a synergy between genomic and epigenomic variation affecting risk of asthma.
Acknowledgements
We thank the many Asthma Bridge participants who contributed tissue samples and personal data. We also thank the clinicians, and researchers who collected samples and recorded data. This work was supported by NIH grants, NIEHS grant No. R01ES022216, NHLBI grant No. R01HL118455, P01CA196569, and RC2 HL101543.
Appendix
Genetic-Epigenetic Interactions in Asthma Revealed by a Genome-Wide Gene-Centric Search
CASI Methodology
To calculate the CASI statistic we consider n observations of SNP and methylation measures with the two types of variables congregated into sets of size p and q, respectively, and assembled into a (nY + nN) × (p + q) matrix X, where
We designate SNPs by g, methylation probes by m, and asthma case and control by Y and N, respectively. CCA is used to find coefficients aˆkY and bˆkY for gY and mY, respectively, that maximize the Pearson correlation between their corresponding canonical variates, where each canonical variate is a linear combination of a variable set formed as rˆkY = gYaˆkY and lˆkY = mYbˆkY, where gY is a nY × p matrix, mY is a nY × q matrix, aˆkY is a p × 1 vector, bˆkY is a q × 1 vector, rˆkY is a nY × 1 vector, and lˆkY is a nY × 1 vector. The canonical correlations are based on Pearson function,
where
The first canonical correlation coefficient, pˆ1Y, is the maximum across all possible coefficients, aˆ1Y, and bˆ1Y, and subsequent canonical correlations are formed using linear combinations that are uncorrelated with the previous k-1 canonical variates. Only the first canonical correlation coefficient is used for the CASI statistic.
pˆ1Y is calculated from the subset of asthma cases and pˆ1N is obtained by applying coefficients aˆ1Y and bˆ1Y to gN and mN, control data. Thus, the CASI approach can be thought of as a multivariate analog to computing a case-only interaction estimate and then comparing it to an estimate using controls. It is important to emphasize that variates for the control subset are formed based on maximizing canonical correlation in the case subset; rˆ1N = gNaˆ1Y and lˆ1N = mNbˆ1Y, and correlation calculated by pˆ1N = pˆN {rˆ1N, lˆ1N} = cor (rˆ1N, lˆ1N). Permuted data sets are generated by permuting case-control phenotype labels. We will use obs to denote the observed or nonpermuted data and perm to denote SNP and methylation data with permuted case-control labels. Also, we will drop the subscript, 1, since we will be referring to the first canonical coefficient only. For variance stabilization, Fisher transformations,
are applied to the resulting canonical correlation coefficients,
and
and analogously for the permuted data,
and
for perm. ∈ {1, 2, ..., B} where B is the number of permutations. The differences from the observed, Zobs.{Y–N} = Zobs.Y – Zobs.N, and permuted, Zperm.{Y–N} = Zperm.Y – Zperm.N, Fisher transformed correlations are then scaled by subtracting mean, μˆZperm. = MEAN (Zperm.{Y–N}), and dividing by standard deviation, σˆZperm. = S.D. (Zperm.{Y–N}), resulting from permuted data for each gene to obtain the final observed CASI statistic, (Zobs.{Y–N} – μˆZperm.) / σˆZperm., and permuted CASI statistics, (Zperm.{Y–N} – μˆZperm.) / σˆZperm.. The rationale behind centering with the permuted sample statistic mean is that the expectation of the canonical correlation coefficient will be different in cases versus controls under the null due to the fact that the maximization is computed in cases. Centering corrects for this difference. Scaling by the permuted sample statistic standard deviation is performed because the standard deviation may depend to some extent on dimensional and distributional characteristics of the sets g and m. Scaling standardizes the statistic for these factors.
It will be assumed here that multiple tests will be conducted in an analysis, such as was performed in this study, i.e., a single test for each gene. Under these circumstances, the CASI statistics are evaluated for statistical significance by computing FDR using permutation-based estimators proposed by Millstein and Volfson [53], which yields confidence intervals that account for the number of permutations conducted as well as dependencies between tests. Significance is assessed directly using CASI statistics, thus no intermediate p values are computed.
Simulation 1 – Tests for Interaction by Examining Equivalence of Set Correlations between Cases and Controls
As the description of the CASI method makes clear, our statistic measures a difference in correlation structure between two class levels of a dichotomous phenotype, case and control. We demonstrate in silico that statistical interactions between linear combinations of variables in two sets will cause a difference in correlations between those linear combinations in cases versus controls, and further that such interactions are detectable by the CASI statistic by measuring the difference in correlation coefficients.
We simulated two groups of genomic variables with correlations simulated in blocks for cases and controls. Representing interactions as the difference in correlation matrices across cases and controls is not common but also not unprecedented, for example, Rajapakse et al. [54] and Peng et al. [55]. This idea is related to the concept described by Millstein et al. [56], that interactions may be implicated if the joint distribution of genomic features depends on disease status.
This prompts the tack we took in how we set up the simulation study. If the off-diagonal blocks of the correlation matrices corresponding to the covariance between two different groupings of genomic factors differ between cases and controls, we infer interaction. This is also the reason that as constructed for simulation, the correlations within the groupings were the same across case-control status, distributions for groups separately did not vary by disease status.
The simulation study was constructed to show that the CASI method can be used to detect such contrasts in correlation, and therefore interactions. For each disease status, two sets of multivariate normal variables were generated with correlation structures in blocks. Elements of the correlation matrix within blocks were randomly selected but identical within cases and controls. In contrast, the off-diagonal blocks that contains elements for correlations between the two feature sets differ between cases and controls. This is shown in more detail below using matrix symbols. The disease status distinguishing elements are set to 0.1 for controls and range from 0.11 to 0.6 for cases. An equal sample size was generated for cases and controls for each scenario. Statistical significance was assessed using the permutation-based FDR approach from Millstein and Volfson [53], with 100 permutations and FDR estimated over a range of significance thresholds.
Hypothesis Test
To establish the form of the null and alternative hypotheses:
H0: pcase = pcontrol
HA: pcase ≠ pcontrol
where pm (m∈ {case, control}) is the correlation structure between two sets of variables within respective phenotype class, m. This can be embodied in a correlation matrix and understood as the intersection of linear column spaces spanned by the two sets. The common space defines statistical variance explained in one set by another, estimated through decomposition of the matrix.
This test signifies that if the correlation patterns are different between cases and controls, we conclude that there is an interaction. A disease-linked interacting factor might be closely related to others. For example, an SNP near more than one genotyped marker in the region. Therefore, our method examines a set of genetic and epigenetic variables in a region jointly and can possibly offer greater statistical power than the classical method of investigating 2-way interactions in univariate logistic regression models.
Simulation Study
We attempt to simulate a simplified version of biological data. In general, we assume that groups of SNPs and CpG sites act in concert based on biological processes. We model this with a 4-block correlation matrix, a block of correlations randomly assigned to pairs from the same gene set (SNP(gene i) × SNP(gene i)), block of (CpG(gene i) × CpG(gene i)) correlations assigned likewise, and 2 blocks, one transpose of the other, (SNP(gene i) × CpG(gene i),), equi-correlated. One way of interpreting this is as a latent factor model, all the SNPs or CPGs from a gene correlated in a block are in turn highly correlated with the same unmeasured latent variables, hypothetically expressed as weighted sums, the canonical variates that underlie the canonical correlation procedure. This interpretation is not necessary to understand the SNP(gene i) × CpG(gene i), block as representing correlation structure between two sets of variables. In our simulations each block within the correlation matrix contains 25 elements (5 SNPs × 5 CpGs for example).
We simulate the SNPs and methylation variables (CpGs) for our healthy controls as jointly Gaussian (multivariate normal) with 0 mean and correlation matrix
where Gi × Gi and Mi × Mi are 5 × 5 matrices with 1s along the diagonal and ρ > 0 randomly assigned from uniform distribution for all off-diagonal entries. Gi(control) × Mi(control) and Mi(control) × Gi(control) are 5 × 5 matrices, transpose of each other with all elements, ρcontrol > 0, having the same, fixed value (equi-correlated). For our case patients we again use mean 0 variables, but change our correlation matrix to
where Gi × Gi and Mi × Mi are 5 × 5 matrices with the same element values as for controls. Gi(case) × Mi(case) and Mi(case) × Gi(case) are 5 × 5 matrices, transpose of each other with the same element values of ρcase > ρcontrol > 0 (equi-correlated).
Effect size is expressed as a measure of the “distance” between the two correlation matrices used for simulating samples. This metric is quantified as “norm” of the difference between two matrices. In particular, we use the spectral norm or 2-norm, which is the largest singular value. If we denote our two matrices as Σcaseand Σcontrol then our distance metric, effect size, is ∣∣Σcase – Σcontrol∣∣, where ∣∣ ∣∣, represents the norm.
Within each class (case and control) and for a range of effect sizes we simulated a sample of size n ∈ {50, 100, 150, 200, 250} and applied CASI method. We then estimated the FDR of this method over 1,000 trials for each. As we can see in Figure A2, despite considerable volatility for very small effect sizes, CASI performed quite well for meaningful differences in correlation. To affirm the ability of our test to detect a difference in correlation structure, for each simulation, we set elements of Gi(case) × Mi(case) and Gi(control) × Mi(control) (along with their transposes) blocks in the correlation matrices to different values. For each simulation, we generated a sample of “case” status individuals with ρcase ∈ {0.11, 0.12..., 0.40} and “control” status subjects using ρcontrol = 0.1. The respective effect sizes, distances between correlation matrices (spectral norms of difference between two matrices), are ∈ {0.05, 0.075, 0.10, ..., 1.50}.
FDR plots in Figure A1 are snapshots of the entire range of simulations. For each sample size, we garner from our FDR estimates the number of true positive tests that can be identified, hence giving us a measure of power of our test. The results show that for increasing effect and sample sizes our test is capable of detecting non-null hypotheses at near 100% power. The lower right plot of Figure A2 shows consistency for the CASI hypothesis test, the various effect sizes (vertical axis) and the corresponding sample sizes (horizontal axis) at which 100% power is achieved. The jagged power plots for samples 50, 100, 150, 200, and 250 indicate, as expected, that for larger samples we can expect to reach power of 100% at smaller effect sizes. The criteria for identifying power is the proportion of tests identified correctly as true positives at the maximum FDR ≤0.05. From the Figure A2 plots, it is clear that power increases with effect size. We showed this relationship under the assumption of multivariate normal distribution of features, but under conditions CASI will be typically be applied this property is likely to remain valid.
Simulation 2 Study – CASI as a Method for Detecting Multiplicative Interaction
In order to assess type I error rate control, statistical power, and unbiasedness properties of our method to detect multiplicative interaction, we conducted a simulation study in a logistic regression framework. In traditional power analysis, a non-null model is used to generate a sample repeatedly, a statistic is calculated each time and compared to a threshold corresponding to the designated type I error rate. Empirical statistical power is calculated as the proportion of tests that reject the null in favor of alternative hypothesis.
Another, equivalent approach more suitable to use the FDR as a way of conceptualizing type I error rate and evaluating power in hypothesis testing when conducting multiple comparisons is to generate a sample repeatedly from a mixture/amalgam of null and non-null models and calculate empirical statistical power as proportion of instances of the statistic arising from the non-null model that are captured for a given FDR level. Since our method relies on permutation for calculating a statistic we do not have a known reference null distribution to determine significance for calculating power. Rather, we estimate FDR based on the empirical distribution of observed values of our statistic from the multitude of tests by comparing them to realizations from permuted data.
This is akin to generating values from a mixture distribution of two kinds of logistic models. We decided to have 10% of models in our mixture non-null, in deference to 2007 paper by Brad Efron, “Size, Power and False Discovery Rates,” which used the same proportion in its simulation. Their version of FDR is parametric, which contrasts with ours since our method relies on permutation for calculating a statistic. We do not have a known reference null distribution to determine significance for calculating power. Rather, we estimate FDR based on the empirical distribution of observed values of our statistic from the multitude of tests by comparing them to realizations from permuted data.
When conducting simulations, we are interested in examining power for various “effect” sizes. The logistic regression is specified via the coefficients, β0, βx; the first element is the intercept (disease background prevalence) and the second consists of a vector of log OR parameters for the covariates (for example genomic variables such as SNPs). The overall outcome prevalence in the population of interest is fixed in our simulation at 0.05. Modifying any given element of βx will automatically modify the overall prevalence, unless there is a corresponding change in β0. In the simulations we adjust the value of β0 so that it minimizes the difference between the target outcome prevalence, 0.05, and prevalence induced by the model in conjunction with the assumed marginal exposure and interaction values. In effect, we are generating a population of individuals that hold a gene pair that affects disease through specified logistic model, holding prevalence invariant, and then take a sample from that population.
As stated above, we simulate a collection of genes that is a mixture of null and non-null gene pairs. Non-null gene pair is defined as one in which disease state of an individual is the result of interaction between two sets of SNPs and is generated from a logistic model as follows. Null and non-null disease logistic models are defined above.
Probability of disease in non-null gene pair model:
Null gene pair is defined as one in which disease state of individual is a result of an additive effect of two sets of SNPs.
Probability of disease in null model:
SNP data is simulated based on minor allele frequency that is randomly chosen from a uniform distribution ranging from 0.05 to 0.49.
For the simulation, 100 non-null and 900 null gene pairs were generated. After application of CASI, the distribution of non-null gene pairs in the resulting ranked set is evaluated as a means of power diagnostic. As we vary interaction OR, we observe the number of non-null genes identified as significant. The larger the proportion of non-null genes (out of a total of 100) detected for low values of FDR the higher the power. In the region bracketing the FDR level of 0.05, CASI performed well in comparison to conventional logistic regression, the approach of Rajapakse et al. [54] and Peng et al. [55] (Fig. A3).
Interpreting CASI Results via Canonical Correlation: Loadings
Having shown that our method is capable of detecting interaction, once CASI is applied to data the next step would be to interpret results in such a way that conclusion can be drawn about which components of the sets are involved in interaction. This is accomplished through canonical variates. Interpretation of the CASI canonical variates, rˆ1Y = gYaˆ1Y and lˆ1Y = mYbˆ1Y, is aided by computing loadings, correlations between canonical variates and the original variables in the sets gY and mY. The loading of each variable gives a rough gauge of how representative the variate is of the variable and thus the importance of the variable in the test results. There is no specific guidance to how large a loading is considered an important contribution, but we have chosen a threshold of 0.5 because it is regarded as the “operational” definition of a large effect size [57, 58].
CASI Analysis of Asthma BRIDGE Data
Analyses of the top loading SNP-CpG pairs for the 12 significant genes using conventional logistic regression yielded significant interactions for some, e.g., SCARNA8 and LHX6, but not all pairs (Table A1). Interestingly, there was also some suggestion of SNP main effects, e.g., in STC1 and TPRA1. The top loading pair would not necessarily yield a significant interaction in a conventional model, because the effect could more diffuse, distributed over multiple SNPs and CpG sites, implying joint interactions that are more detectable by the CASI method. A sensitivity analysis indicated that the results were robust to the distribution of participants with regard to study site. Restricting the analysis to sites that included both cases and controls and adjusting for site as a covariate in the restricted sample yielded similar results, and in some cases increased the statistical significance of the interactions (Table A2).
References
Vladimir Kogan and Joshua Millstein contributed equally to this work.