Abstract
In the past decades, studies on twins have had a great impact on dissecting the genetic and environmental contributions to human diseases and complex traits. In the era of functional genomics, the valuable samples of twins help to bridge the gap between gene activity and environmental conditions through multiple epigenetic mechanisms. This paper reviews the new developments in using twins to study disease-related epigenetic alterations, links them to lifetime environmental exposure with a focus on the discordant twin design and proposes novel data-analytical approaches with the aim of promoting a more efficient use of twins in epigenetic studies of complex human diseases.
Introduction
By definition, epigenetics deals with heritable and acquired changes in gene function that occur without a change in the DNA sequence. In a broad sense, the epigenetic control over gene activity involves multiple molecular mechanisms including the binding of small molecules to specific sites in DNA or chromatin, noncoding RNAs, microRNAs, incorporation of histone variants into chromatin, etc., all of which act as ‘volume controls' that up- or downregulate a gene's expression without changing its DNA sequence. Figure 1 illustrates the coverage of epigenetic epidemiology and its relation to traditional epidemiology and genetic epidemiology. Among the various epigenetic mechanisms, DNA methylation is the major form of epigenetic modification. It is very robust and readily measurable using high-throughput techniques [1], which enable massive epigenetic profiling on a genome-wide scale. Although the molecular evidence is interesting and current techniques allow genome-wide epigenetic (DNA methylation) profiling, identifying and understanding the epigenetic patterns under a given genetic predisposition and environmental exposure impose new challenges to traditional epidemiology both in experimental design and in methodological issues.
Like any complex trait, the epigenetic regulation of gene activity is under control of both genetic and environmental factors. In fact, recent studies have shown that the impact of environmental factors can be acquired via the epigenome or genome in epigenetics [2,3,4,5,6], which is one of the hot topics in cancer and complex disease studies that is currently drawing active research. Fueled by the rapid development in epigenomic analysis using next-generation sequencing or array-based technologies, the newly emerging epigenetic epidemiology is serving as a bridge linking gene activity and environmental conditions. In this regard, twins are useful samples that can help with dissecting the genetic and environmental components in epigenetic regulation [7]. In particular, identical or monozygotic (MZ) twins discordant for a disease or trait of interest are unique samples for associating epigenetic alteration with disease and environmental conditions.
Discordant Twin Design
Similar to genome-wide association studies, an epigenome-wide association study (EWAS) can be done using a simple case-control design [8]. Data created from such studies can easily be handled by Student's t test or a simple regression analysis. It should be noted that in genome-wide association studies we associate disease status or phenotype with static genotypes (phenotype-genotype association), whereas in EWAS, we correlate disease with a molecular functional phenotype (e.g. DNA methylation level) which can be under the control of a large number of both genetic and environmental factors (phenotype-phenotype association). The latter can be low powered by uncertainty in the molecular phenotype brought about by genetic and nongenetic confounding.
With the engagement of the discordant MZ twin design, the genetic part of the confounder is removed (fig. 1) because the genetic background is perfectly matched out within each pair [7], enabling the identification of differential DNA methylation triggered by environmental exposure. With this design, identical twins discordant for a phenotype of interest are sampled from independent families with which intrapair genome-wide epigenetic differences can be determined by high-throughput techniques for measuring differential epigenetic modifications.
In the literature, the design has been applied in epigenetic studies for more than 10 years [9]. As shown in table 1, more and more epigenetic studies have used the discordant twin design, which has experienced considerable increases in sample sizes in recent years. Meanwhile, coverage in these studies has expanded from focused genes or regions to genome-wide analysis. Early epigenetic studies adopting the discordant twin design used descriptive analysis, perhaps due to very small sample sizes (table 1). Even with increasing sample sizes in recent years, most studies rely on a simple paired t test for data analysis. Analyzing and interpreting the high-dimensional data has thus become a challenging issue.
Data Analysis Strategies
Although the paired t test seems to fit the design nicely, there are important issues that need to be considered. First, it ignores twin-pair-specific covariates such as age and sex that could affect within-pair epigenetic differences. The effect of age on within-twin-pair epigenetic variation was illustrated by Fraga et al. [2] who found that young twins are epigenetically more similar than old twins, suggesting epigenetic modification induced by accumulated stimulation by internal and external factors including environmental exposure. We calculated within-pair correlations on genome-wide DNA methylation profiles for identical twins measured 10 years apart (longitudinal samples) and for twins at young and old ages (cross-sectional samples). Both showed obvious declines in the age pattern of epigenetic correlation in MZ twins [unpubl. data]. Furthermore, the paired t test is unable to include environmental exposure variables such as smoking and drinking. Those exposure variables are valuable for epidemiological analysis. This means that the univariate analysis only captures the association between epigenetic variation and disease, but leaves the environmental involvement in the association unassessed. Apart from the fixed effects of twin-pair-specific and environmental variables, there are also random factors such as sample location on the array, laboratory environment and variations due to time and technicians during handling, the so-called batch effect, which need to be considered because they introduce systematic influence on experimental outcomes. Taking all of these factors into consideration, we propose a mixed-model approach specifically designed for EWAS on discordant twins with the following formula:
Here, Me(+) and Me(-) are DNA methylation levels measured in the affected and unaffected twin of the same pair, β s are the slopes for fixed-effect variables (x1 to xn) and y1 to ym are the random-effect variables including batch effects. In this model, the intercept α is an important parameter because it stands for the mean fold change in the DNA methylation level between affected and unaffected twins. By testing the null hypothesis of α = 0, we are able to test if the disease is associated with hypermethylation (α>0) or hypomethylation (α<0) at the CpG site being tested. Except for the twin-pair-specific variables, e.g. age and sex, the rest of the fixed effects are due to differential exposure to specific environments. If the β of a differential exposure variable is significantly different from zero, the corresponding environment causes hypermethylation (β>0) or hypomethylation (β<0) at the CpG site in the affected twins, establishing the link between environmental exposure and epigenetic regulation and disease.
Twin Modeling in Epigenetic Epidemiology
The classical twin model has made valuable contributions to genetic epidemiology by enabling genetic and environmental contributions to be explored in human diseases. By treating DNA methylation level as a continuous phenotype, the twin model can be applied to the study of epigenetic epidemiology. For example, Bell et al. [10] recently estimated heritability of the DNA methylation phenotype using twin methods and reported a genome-wide heritability estimate of 18%. The twin model, when applied to each region or CpG site, estimates both genetic and environmental contributions to the epigenetic variation at that location. Here, the environmental part can be further divided into the common environment shared by twin pairs (the rearing environment) and the nonshared unique environments for each twin. The former can be interesting because it links the early-life environment with the later-life epigenetic status. Preliminary results from our twin modeling of genome-wide DNA methylation data show that while some genes are highly under genetic control, there are lots of genes whose activities are predominantly regulated by unique environmental factors [unpubl. data]. Further bioinformatic analysis will elucidate the biological and functional profiles of these genes.
Conclusions
Unlike traditional epidemiology that associates environmental exposure with disease, epigenetic epidemiology establishes a link between environmental exposure and epigenetic regulation and disease condition, thereby providing a molecular basis of disease etiology triggered by environmental cues. Although modern technologies allow high-throughput profiling of epigenetic patterns already at the genome level, our understanding of genetic and environmental influences on epigenetic processes remains limited. With proper designs and analytical approaches, studies on twins can help us identify epigenetic marks and link them with environmental exposure. It can be expected that the valuable twin samples are going to make new contributions to revealing and understanding the genetic and epigenetic mechanisms of human diseases.
Acknowledgements
This work was jointly supported by a Novo Nordisk Foundation 2010 research grant and a research grant from Region of Southern Denmark 2010 awarded to Dr. Qihua Tan.