Abstract
Copy number variation is a defining characteristic of human subtelomeres. Human subtelomeric segmental duplication regions (‘Subtelomeric Repeats’) comprise about 25% of the most distal 500 kb and 80% of the most distal 100 kb in human DNA. Huge allelic disparities seen in subtelomeric DNA sequence content and organization are postulated to have an impact on the dosage of transcripts embedded within the duplicated sequences, on the transcription of genes in adjacent single copy DNA regions, and on the chromatin structures mediating telomere functions including chromosome stability. In addition to the complex duplicon substructure and huge allelic variations in extended subtelomere regions, both copy number variation and alternative sequence organizations for DNA characterize the sequences immediately adjacent to terminal (TTAGGG)n tracts (‘subterminal DNA’). The structural variation in subterminal DNA is likely to have important consequences for expression of subterminal transcripts such as a newly-discovered gene family encoding actin-interacting proteins and a non-coding telomeric repeat containing RNA (TERRA) transcript family critical for telomere integrity. Major immediate challenges include discovering the full extent and nature of subtelomeric structural and copy number variation in humans, and developing methods for tracking individual allelic variants in the context of total genomic DNA.
Terminal (TTAGGG)n tracts and immediate subterminal sequences
The DNA at each human chromosome terminus is a simple repeat sequence tract (TTAGGG)n, typically 5 kb to 15 kb in length in somatic cells (Moyzis et al., 1988). The average lengths of terminal repeat tracts in humans are dynamically modulated by a balance of exonuclease- and replication-associated attrition and telomerase- and/or recombination-mediated lengthening (Harley et al., 1990; Reddel, 2003; Blackburn, 2005). The average bulk telomere length is believed to reflect a partly inherited telomere length trait (Slagboom et al., 1994), as well as both the replicative history of the cell lineage and environmental factors such as exposure to stress and oxidative damage (Wright and Shay, 2002; Aviv et al., 2003; Epel et al., 2004). A single-stranded, G-rich overhang produced as a result of the RNA priming requirement for DNA replication or by post-replicative processing steps (Sfeir et al., 2005) is involved in the formation of a t-loop secondary structure (Griffith et al., 1999) believed to be required for proper telomere function. Disruption of t-loop or associated telomeric chromatin structures by interference with Shelterin protein components required for their formation and maintenance (Karlseder et al., 2002), by loss of the G-overhang itself, or by a critically short tract of (TTAGGG)n can each lead to telomere dysfunction (de Lange, 2005). Depending on the cellular context (especially the presence or absence of a functional p53 pathway), such telomere loss or uncapping can trigger cellular senescence, apoptosis, and/or genome instability.
The replicative capacity of cells is limited by telomere length, implicating short telomere length in age-related diseases (Harley et al., 1990; Lee et al., 1998). Supporting this notion, heterozygous germline mutations in the telomerase protein component (hTert; encoded by TERT) or RNA component (hTR; encoded by TERC) are associated with abnormally short telomeres in humans and can lead to autosomal dominant forms of the diseases dyskeratosis congenita (characterized by aplastic anemia and bone marrow failure, as well as predisposition to cancers of the skin, hematopoietic system, and oral mucosa) and idiopathic pulmonary fibrosis (leading to respiratory failure) (Vulliamy et al., 2001; Armanios et al., 2007). Because anticipation occurs in autosomal dominant dyskeratosis congenita (more severe and earlier occurring phenotypes are seen in successive generations), the short telomeres must cause the disease rather than the telomerase mutation itself (Armanios et al., 2005). Thus, short telomeres limit tissue renewal capacity in somatic tissues, even in the presence of telomerase (Hiyama and Hiyama, 2007), and can cause both congenital and age-related disease in humans.
The lengths of (TTAGGG)n tracts have been shown to vary from telomere to telomere within individual cells (Lansdorp et al., 1996; Zijlmans et al., 1997; Graakjaer et al., 2004) and between alleles at the same telomere (Baird et al., 2003; der-Sarkissian et al., 2004). These individual-specific patterns of relative telomere-specific (TTAGGG)n tract lengths are regulated in part by cis-acting factors (Graakjaer et al., 2003, 2006; Britt-Compton et al., 2006), and these patterns appear to be defined in the zygote and maintained throughout life (Graakjaer et al., 2004). Since the immediate effects of (TTAGGG)n tract loss on cell viability and chromosome stability are attributable to the shortest telomere(s) in a cell, rather than to average telomere length (Hemann et al., 2001; der-Sarkissian et al., 2004; Meier et al., 2007), individual-specific patterns of allele-specific (TTAGGG)n tract lengths may be crucial for the biological functions of telomeres and the effects of telomere attrition and dysfunction associated with aging, cancer, stress and coronary artery disease (Wright and Shay, 2002; Aviv et al., 2003; Epel et al., 2004).
Cis-acting mediators of allele-specific (TTAGGG)n tract length are likely to depend upon subtelomeric sequences immediately adjacent to terminal (TTAGGG)n tracts. These sequences have been determined for the reference sequences of some chromosome arms (Riethman et al., 2004), and a model of human terminal and subterminal DNA based upon these completed reference sequences is shown in Fig. 1. All subterminal human DNA fragments cloned in bacteria (fosmids, cosmids) and yeast (half-YACs) have lost the terminal portion of the (TTAGGG)n tract and retain 300 to 800 bp of the centromere-adjacent region of the simple repeat tract; the cloning-associated deletions appear to be specific for the distal (TTAGGG)n regions and do not involve rearrangement of subterminal sequences (Riethman, unpublished). Most of the known human (TTAGGG)n-adjacent sequences are related in some fashion to the telomereassociated repeat (TAR) sequences originally described by Brown and co-workers (1990). Parts of the canonical TAR1 sequence are present within 2 kb of the beginning of nearly all sequenced (TTAGGG)n-adjacent DNA, although the part and degree of similarity can vary substantially. TAR1 similarity can also be found adjacent to many of the internal (TTAGGG)n-like sequences in Srpt regions, and more distantly related copies of TAR1-like sequence are often found in pericentromere regions (Ambrosini et al., 2007). Interestingly, one known case of a naturally-occurring segment of (TTAGGG)n-adjacent DNA lacking TAR1 similarity is an allele of 17p shown by single telomere ligation-associated PCR (STELA) to be shorter than most other telomeres (Britt-Compton et al., 2006). CNVs in these duplicated subterminal sequences are known to exist (Riethman, 2008) and are likely to affect the allele-specific cis-regulation of (TTAGGG)n tracts.
Model of terminal and subterminal human DNA. The single-stranded, G-rich 3′ overhang is shown in red; black represents double-stranded (TTAGGG)n repeat DNA (5′-(TTAGGG)n -3′ towards the telomere end, 5-(CCCTAA)n -3′ strand toward the centromere). The lighter gray blocks interspersed with black represent the most centromeric part of the terminal repeat tract, where blocks of variant repeats related to (TTAGGG)n are usually present and interspersed with canonical (TTAGGG)n repeats in allele-specific patterns that are in linkage disequilibrium with the 2–4 kb of adjacent subterminal DNA. The pink region represents variably-sized (0–2 kb) sequence segments bearing similarity to the TAR1 repeat family (Brown et al., 1990), and the blue region represents subterminal DNA which can be either one-copy (six telomeres) or fall into one of the six families of subterminal repeat homology we have recently characterized (Ambrosini et al., 2007). The minimum length of the human terminal repeat tract has been estimated at 2 kb using bulk telomere measurements in senescing cells (Levy et al., 1992), but the higher-resolution single telomere length analysis (STELA) found a wide variability of telomere lengths at individual chromosome ends in senescing fibroblasts, with a minimum modal terminal tract length of 300 bp found on an allele of 17p in senescent cell populations (Britt-Compton et al., 2006). Human germline terminal tract lengths display a wide individual-specific variation range (from <9 kb to greater than 17 kb), with a substantial frequency of outliers with very short telomeres found in the sperm DNA samples measured (Baird et al., 2006). Non-canonical terminal repeats found in the centromeric part of the tract include (TGAGGG)n, (TCAGGG)n and (TTGGGG)n (Baird et al., 1995, 2000; Coleman et al., 1999) as well as many others (e.g., (TTCGGG), (TTAGGGG)n, (TCGGGG)n, and others; Xu and Blackburn, 2007; Riethman, unpublished), and can extend telomerically for at least 1 kb into the tract (Baird et al., 1995). The green arrows represent the position and orientation of subterminal transcripts (see Table 1) including telomeric repeat-containing RNA (TERRA) molecules, for which both the 5′ and 3′ end positions remain undefined (Azzalin et al., 2007).
Model of terminal and subterminal human DNA. The single-stranded, G-rich 3′ overhang is shown in red; black represents double-stranded (TTAGGG)n repeat DNA (5′-(TTAGGG)n -3′ towards the telomere end, 5-(CCCTAA)n -3′ strand toward the centromere). The lighter gray blocks interspersed with black represent the most centromeric part of the terminal repeat tract, where blocks of variant repeats related to (TTAGGG)n are usually present and interspersed with canonical (TTAGGG)n repeats in allele-specific patterns that are in linkage disequilibrium with the 2–4 kb of adjacent subterminal DNA. The pink region represents variably-sized (0–2 kb) sequence segments bearing similarity to the TAR1 repeat family (Brown et al., 1990), and the blue region represents subterminal DNA which can be either one-copy (six telomeres) or fall into one of the six families of subterminal repeat homology we have recently characterized (Ambrosini et al., 2007). The minimum length of the human terminal repeat tract has been estimated at 2 kb using bulk telomere measurements in senescing cells (Levy et al., 1992), but the higher-resolution single telomere length analysis (STELA) found a wide variability of telomere lengths at individual chromosome ends in senescing fibroblasts, with a minimum modal terminal tract length of 300 bp found on an allele of 17p in senescent cell populations (Britt-Compton et al., 2006). Human germline terminal tract lengths display a wide individual-specific variation range (from <9 kb to greater than 17 kb), with a substantial frequency of outliers with very short telomeres found in the sperm DNA samples measured (Baird et al., 2006). Non-canonical terminal repeats found in the centromeric part of the tract include (TGAGGG)n, (TCAGGG)n and (TTGGGG)n (Baird et al., 1995, 2000; Coleman et al., 1999) as well as many others (e.g., (TTCGGG), (TTAGGGG)n, (TCGGGG)n, and others; Xu and Blackburn, 2007; Riethman, unpublished), and can extend telomerically for at least 1 kb into the tract (Baird et al., 1995). The green arrows represent the position and orientation of subterminal transcripts (see Table 1) including telomeric repeat-containing RNA (TERRA) molecules, for which both the 5′ and 3′ end positions remain undefined (Azzalin et al., 2007).
Subterminal transcripts
Telomeric DNA is transcribed. Indeed, a recently-discovered noncoding subterminal transcript that includes sequences from the terminal repeat itself may prove to be central to understanding telomere biology. The telomeric repeat-containing RNA (TERRA) molecules found in humans range in size from about 100 bases to at least 9 kb, are transcribed almost exclusively from the C-strand of subterminal DNA from multiple (perhaps all ?) telomeres, and associate with telomeric chromatin (Azzalin et al., 2007). The association of TERRA molecules with telomeric chromatin is regulated by protein effectors of the nonsense-mediated RNA decay pathway, which include the human homologs of the yeast Est1p protein; depletion of effectors increased association of TERRA molecules with telomeric chromatin and caused loss of telomere DNA. A similar telomeric RNA molecule found in mouse was shown to be transcribed by RNA polymerase II, polyadenylated, and under developmental regulation (Schoeftner and Blasco, 2008). Evidence has been presented suggesting that high levels of the mouse telomeric repeat-containing RNA could inhibit telomerase activity (Schoeftner and Blasco, 2008). Noncoding RNAs have been shown to play roles in heterochromatin formation (Sugiyama et al., 2007) and X-inactivation (Wutz and Gribnau, 2007). Similarly, TERRA molecules may be involved in the modulation of telomeric chromatin structure and telomeric heterochromatin formation.
Several additional transcript families are embedded within or extend into subterminal DNA (Fig. 1, Table 1). Although many members of these transcript families appear to be non-coding pseudogenes, some contain open reading frames and encode protein. One particularly interesting subterminal transcript family (originally named cXYorf1-related; Gianfrancesco et al., 2001) is associated with a particular subterminal duplicon block (Ambrosini et al., 2007); the 3′ ends of this transcript family end within 4–6 kb of the start of the (TTAGGG)n tract (Fig. 1, Table 1). Very recently, this cXYorf1-related transcript family has been shown to correspond to the 3′ ends of genes from a new subclass of the Wiscott-Aldrich Syndrome Protein (WASP) family, termed WASH genes (Linardopoulou et al., 2007).
The WASH protein family is old and conserved. It is single-copy in most lower organisms and in early primates, and was shown to be essential for survival in Drosphila (Linardopoulou et al., 2007). However, it is multicopy in gorilla, chimps, and humans, with the highest copy numbers in human. WASH proteins appear to be involved in cytoskeletal organization and signal transduction, and can function in the regulation of actin filament polymerization. The human WASH gene family structures predict short forms and long forms of the protein; the 3′ exons encoding the short form are embedded in subterminal duplicon block C (Ambrosini et al., 2007), whereas the full-length forms start near a CpG island roughly 20 kb from telomeres within a polymorphically distributed segmental duplication (Martin-Gallardo et al., 1995). The WASH family varies widely in both dosage and telomere distribution in individual genomes, and usually terminates less than 5 kb from the start of the terminal (TTAGGG)n tract (Table 1). Individual telomeric transcription sites for this family might be differentially susceptible to position effects depending upon both local telomeric chromatin/heterochromatin status and on chromosome-specific telomere lengths. Similarly, CNV either disrupting the gene structure or affecting WASH dosage could readily affect the number of alleles encoding functional protein within a genome. The subtelomeric/subterminal genomic location, as a hotspot of DNA break and repair (Linardopoulou et al., 2005; Rudd et al., 2007), has likely contributed to the production of many copies and combinations of the short and long forms of WASH gene family members in very recent primate and human evolution. It is intriguing to speculate on potential roles this gene family may have played in the evolution of primate and human-specific phenotypes, as well as potential roles in the somatic evolution of cancer genomes, where telomere loss or dysfunction is believed to play an initiating role (Meeker et al., 2004; Baird et al., 2006; Capper et al., 2007).
A second major subterminal transcript family consists of members that bear highest similarity to a ribosomal protein pseudogene, RPL23AP7. As with the WASH transcript family, transcripts with both interrupted and open reading frames co-exist within the genome, and the 3′ ends of most transcripts extend into subterminal duplicon blocks (in this case, related blocks A, B, and D; Ambrosini et al., 2007). No information is yet available as to whether proteins corresponding to the open reading frames from this transcript family are made. Curiously, transcription of both the RPL23AP7 and WASH transcript families is in the same orientation and their 3′ end positions coincide with where TERRA RNAs are expected to initiate, begging the question of whether these abundant transcript families are regulated in a coordinate fashion.
A collection of apparently less abundant transcripts are transcribed in opposite orientation to the TERRA, WASH, and RPL23AP7-like families; these are also listed in Table 1. There seems to be an overrepresentation of testis transcripts in this group of RNAs, the 5′ ends of which originate within 1–4 kb of the start of the (TTAGGG)n tract. Little else is known regarding these RNAs.
Telomere epigenetics
Mouse studies have shown that mammalian telomere length and stability can be regulated by epigenetic factors (Blasco, 2007); heterochromatic epigenetic marks are present near mouse telomeres, a loss of some of those marks occurs upon reduction of telomere (TTAGGG)n tract lengths to a critically short length in the absence of telomerase, and de-regulation of telomere tract length and stability is evident in mouse knockouts affecting enzymes responsible for normally maintaining the heterochromatic marks. Loss of heterochromatic marks at mouse telomeres was found to be associated with recombination-associated instability.
Human subterminal DNA is hypomethylated in the germline, methylated de novo in somatic tissues, and may contain non-canonical DNA modifications (de Lange et al., 1990; Steinert et al., 2004). Hypomethylation of subterminal DNA in the human germline (and suppression of meiotic recombination in the most distal 2–4 kb of telomere-adjacent DNA) are characteristics of satellite heterochromatin, but the active transcription of genes within a few kb of some human (TTAGGG)n tracts and the small telomere-adjacent LD region suggest a very small telomeric heterochromatic region and perhaps a precisely regulated heterochromatic-euchromatic boundary region in human subterminal DNA. Human subterminal DNA sequences typically contain clusters of very CpG-rich sequences which are arranged in variably sized and organized stretches among separate telomeres (Riethman, 2008); these sequences may be among those differentially methylated in germline versus differentiated cells, and their methylation patterns may reflect the telomeric heterochromatin-euchromatin transition for specific telomere alleles. Epigenetic modification of subterminal sequences could play an important role in regulating telomere replication and in recombination-associated phenomena such as elevated levels of telomere-associated sister-chromatid exchange (Rudd et al., 2007). Indeed, one could speculate that allele-specific (TTAGGG)n tract length regulation might be mediated by differential positioning of the euchromatin/heterochromatin boundary within subterminal sequences.
Extended human subtelomeric DNA regions
Human subtelomeric segmental duplications (‘subtelomeric repeats’) comprise about 25% of the most distal 500 kb and 80% of the most distal 100 kb in human DNA (International Human Genome Sequencing Consortium, 2004; Riethman et al., 2004). The overall size, sequence content, and organization of subtelomeric segmental duplications relative to the terminal (TTAGGG)n repeat tracts and to subtelomeric single-copy DNA is different for each subtelomere. Subtelomeric regions of human chromosomes have long been known to contain mosaic patchworks of duplicons (e.g., see der-Sarkissian et al., 2002; Mefford and Trask, 2002; Riethman et al., 2004). Recent genome-wide analyses of these regions have revealed that this sequence organization appears to arise from translocations involving the tips of chromosomes, followed by transmission of unbalanced chromosomal complements to offspring (Linardopoulou et al., 2005). A recent analysis of SCEs revealed highly elevated rates near telomeres (1,600× greater in the most distal 10 kb, including subterminal sequences and the telomere repeat itself), and 160× greater in the adjacent 100-kb subtelomere regions (Rudd et al., 2007). Together these studies indicate that subtelomeres are hotspots of DNA breakage and repair; these features are likely to be responsible for the generation of the complex interchromosomal duplication patterns and the rapid evolution of these genomic regions, as well as the prevalence of large CNVs near telomeres.
A detailed analysis of subtelomeric duplicons in the human reference sequence was completed recently (Ambrosini et al., 2007). Duplicon blocks were defined by sequence similarity between segments of subtelomeric DNA from single telomeres and the assembled human genome. Each duplicon module was defined by a set of pairwise alignments with the query subtelomere sequence, and a % nucleotide sequence identity for the non-masked parts of each chained pairwise alignment was derived from the BLAST alignments. This analysis is summarized for a sampling of telomeres in Fig. 2; the full set of modules for all telomeres, including the coordinates of their genomic alignments, is presented on our web site (http://www.wistar.org/Riethman).
Duplicon analysis for example subtelomeres. The most distal regions of the indicated reference sequences are shown oriented with the telomeric end on the left. Duplicated genomic segments are identified by chromosome (color) and whether they are subtelomeric (bounded rectangles), non-telomeric (unbounded rectangles), or intra-chromosomal (located above the subtelomere coordinates). The arrows on the coordinate line represent the positions of (TTAGGG)n and (TTAGGG)n-like internal sequences in each subtelomere. Each rectangle represents a separate duplicon, and each of the pairwise alignments generated by the duplicon definitions was tabulated and analyzed. The line segments ending in arrows above each subtelomere indicate the positions of named transcript families found in these subtelomere regions. Our web site (http://www.wistar.upenn.edu/ Riethman) contains analogous figures and underlying sequence alignments for the duplicon organization of each subtelomere region.
Duplicon analysis for example subtelomeres. The most distal regions of the indicated reference sequences are shown oriented with the telomeric end on the left. Duplicated genomic segments are identified by chromosome (color) and whether they are subtelomeric (bounded rectangles), non-telomeric (unbounded rectangles), or intra-chromosomal (located above the subtelomere coordinates). The arrows on the coordinate line represent the positions of (TTAGGG)n and (TTAGGG)n-like internal sequences in each subtelomere. Each rectangle represents a separate duplicon, and each of the pairwise alignments generated by the duplicon definitions was tabulated and analyzed. The line segments ending in arrows above each subtelomere indicate the positions of named transcript families found in these subtelomere regions. Our web site (http://www.wistar.upenn.edu/ Riethman) contains analogous figures and underlying sequence alignments for the duplicon organization of each subtelomere region.
Several key features of duplicon organization were revealed by this analysis (Ambrosini et al., 2007). First, the internal (TTAGGG)n sequences are usually oriented towards the telomere and almost always co-localize to duplicon boundaries. In yeast, interstitial telomere-like sequence islands in subtelomeric DNA are capable of playing important roles in subtelomeric recombination and transcription, telomere maintenance via the recombination-based ALT (Alternative Lengthening of Telomeres) pathway, and telomere healing (Stavenhagen and Zakian, 1994; Lundblad, 2002). Several studies have suggested similar roles for internal human (TTAGGG)n-like sequence islands (e.g., Mondello et al., 2000; Azzalin et al., 2001; Ruiz-Herrera et al., 2002), including a hypothesized role as a boundary element for subtelomeric DNA compartments (Flint et al., 1997). (TTAGGG)n-like islands are enriched about 25-fold in human subtelomeric regions (Riethman et al., 2004). In addition, they tend to be both longer and more similar to perfect (TTAGGG)n tracts in subtelomeric DNA compared to elsewhere in the genome. From an evolutionary perspective, this might suggest that most subtelomeric interstitial (TTAGGG)n tracts have arisen more recently than those found elsewhere in the genome, have originated via a separate mechanism than (TTAGGG)n islands found elsewhere (e.g., see Azzalin et al., 2001), or are under some selective pressure to maintain similarity to (TTAGGG)n and a minimum tract length (Flint et al., 1997). Given the known functions of internal telomere sequences in yeast and the demonstrated ability of TRF1, TRF2, and hRAP1 (encoded by TERF1, TERF2, and TERF2IP, respectively) to interact with very short TTAGGGTT motifs (Deng et al., 2003; Zhou et al., 2005), it is important to investigate further the abundance, allelic variation and potential functions of internal (TTAGGG)n sequence islands in human subtelomere regions. It is not known whether TERRA-like molecules are transcribed near internal (TTAGGG)n-like sequences.
Second, the depth of coverage in duplicons for a given subtelomere region (corresponding to duplicon copy number for specific subtelomere coordinates) varies widely across subtelomere regions (Ambrosini et al., 2007). For example, large segments of the 18p subtelomere are duplicated two to six times, whereas other segments are present in much higher copy numbers (20 to 25). Some blocks of duplicated DNA are located exclusively to subtelomeres (e.g., coordinates 0–30 kb of the 17p subtelomere sequence; Fig. 2), whereas the majority of the duplicated subtelomere DNA contains non-subtelomere copies as well (e.g., coordinates 0–45 kb of the 18p subtelomere; Fig. 2).
Third, duplicons defined in this manner that occupy subtelomeric sequences are generally both larger and more abundant than those occurring elsewhere in the genome, consistent with the notion that subtelomeric location in humans is permissive for and/or somehow promotes large duplication events (Ambrosini et al., 2007). Although smaller and fewer, non-subtelomeric copies of duplicons tended to cluster at relatively few pericentric and interstitial loci; these include 2q13→q14 (at the site where ancestral primate telomeres fused to form modern human chromosome 2), 1q42.11→q42.12, 1q42.13, 1q43→q44, 3p12.3, 3q29, 4q26, 7p13, 9q12→q13, and Yq11.23. These sites had also been documented previously in genome-wide analyses of segmental duplications (Bailey et al., 2002) and represent sites that were apparently susceptible to either donation or acceptance of these duplicated chromosome segments in recent evolutionary time.
Subtelomeric transcripts
In addition to the WASH (Linardopoulou et al., 2007) and the RPL23AP7-like gene families described for subterminal DNA sequences (Table 1) and the TERRA RNAs that are transcribed from the regions spanning the subterminal DNA/(TTAGGG)n tract region (Azzalin et al., 2007), many other subtelomeric gene families have been identified that are embedded in the duplicated DNA encompassing more extended subtelomere regions. These include the immunoglobulin heavy chain genes (found at 14q) (Cook et al., 1994) and olfactory receptor genes (1p, 6p, 8p, 11p, 15q, 19p, and 3q) (Trask et al., 1998). Figure 2 shows the positions of named transcript families on each of the selected subtelomeres. In most cases, pseudogene copies of the respective gene family co-exist amongst the duplicons. Thus, all duplicon copies contributing to subtelomere CNVs are not equivalent, and most current methods for detecting and quantitating CNVs are incapable of distinguishing duplicons carrying functional versus nonfunctional genes. This remains a key analytical challenge for CNVs near telomeres. In addition, it is important to note that only a single reference sequence has been sampled so far, and given the abundant large-scale variation in these regions there are certain to be many additional members of most of these gene families yet to be discovered in the human population. A great deal of work clearly lies ahead in terms of deciphering the functions, evolution and population dynamics of individual members of subtelomeric gene families.
Subtelomeric structural and copy number variation and its functional consequences
Large variant regions of subtelomeric DNA detected at many human telomeres were among the first examples of structural variation and Copy Number Variation (CNV) (Wilkie et al., 1991; Ijdo et al., 1992; Cook et al., 1994; Macina et al., 1994, 1995; Martin-Gallardo et al., 1995; Monfouilloux et al., 1998; Trask et al., 1998; van Overveld et al., 2000), which were subsequently found genome-wide (Iafrate et al., 2004; Sebat et al., 2004). The ‘small’ subtelomere alleles sometimes differ from the ‘large’ alleles by hundreds of kilobases of additional subtelomeric DNA, leading to huge allelic disparities in subtelomeric DNA sequence content and consequent separation of single-copy genes from terminal (TTAGGG)n tracts. These large allelic disparities are postulated to have an impact both on the dosage (and hence expression levels) of transcripts embedded within the duplicated sequences, but also upon transcription of genes in adjacent 1-copy regions. Although these polymorphisms are often due mainly to differential Srpt DNA content and organization on particular variant alleles, there are also cases of deletion/insertion polymorphisms of single copy DNA adjacent to the Srpt regions. Figure 3 illustrates several types of polymorphic subtelomere structures characterized to date. Any of these large allelic disparities might be expected to significantly affect subtelomeric transcript identity or abundance, either via copy number changes, by alteration of transcriptional regulation, or by rendering particular transcripts nonfunctional by insertion, deletion, or truncation.
Models of large-scale variation leading to length and copy number variations (CNVs) seen at individual human telomeres. The models are based on published mapping studies as well as our unpublished results. In A the alleles vary by addition/deletion of a large Srpt segment at an existing telomere, with the breakage/joining site correlating with an internal (TTAGGG)n-like island on the long allele. Variants at 16p, 11p, 8p, 1p, and 20q (Wilkie et al., 1991; Riethman et al., 2004; Riethman, unpublished) appear to have this structure. In B the alleles vary in both Srpt content and Srpt organization, without a simple relationship to each other in the variant region. Some of the 6p and 8p variants appear to follow this model, as does a 19p variant (Riethman et al., 2004; Riethman, unpublished). In C the two alleles diverge before the Srpt region begins, leading to a large insertion/deletion polymorphism in distal one-copy DNA. Evidence for this sort of variation exists at the 2q telomere (Macina et al., 1994; also see Ravnan et al., 2006, for similar examples of subtelomeric CNVs). In D the two alleles are similar except for an insertion/deletion polymorphism contained entirely within a single class of duplicated DNA. Evidence for this exists near the 14q telomere where duplicated IgG VH genes are located (Cook et al., 1994; Redon et al., 2006). A specialized version of the model in D is supported by studies of the 4q and the 10q subtelomere, where alleles vary by insertion/deletion polymorphism of tandemly repeated D4Z4 DNA tracts located within the Srpt region (van Overveld et al., 2000). Combinations of these five classes of alternative subtelomeric sequence organization are likely, as are additional types not yet characterized.
Models of large-scale variation leading to length and copy number variations (CNVs) seen at individual human telomeres. The models are based on published mapping studies as well as our unpublished results. In A the alleles vary by addition/deletion of a large Srpt segment at an existing telomere, with the breakage/joining site correlating with an internal (TTAGGG)n-like island on the long allele. Variants at 16p, 11p, 8p, 1p, and 20q (Wilkie et al., 1991; Riethman et al., 2004; Riethman, unpublished) appear to have this structure. In B the alleles vary in both Srpt content and Srpt organization, without a simple relationship to each other in the variant region. Some of the 6p and 8p variants appear to follow this model, as does a 19p variant (Riethman et al., 2004; Riethman, unpublished). In C the two alleles diverge before the Srpt region begins, leading to a large insertion/deletion polymorphism in distal one-copy DNA. Evidence for this sort of variation exists at the 2q telomere (Macina et al., 1994; also see Ravnan et al., 2006, for similar examples of subtelomeric CNVs). In D the two alleles are similar except for an insertion/deletion polymorphism contained entirely within a single class of duplicated DNA. Evidence for this exists near the 14q telomere where duplicated IgG VH genes are located (Cook et al., 1994; Redon et al., 2006). A specialized version of the model in D is supported by studies of the 4q and the 10q subtelomere, where alleles vary by insertion/deletion polymorphism of tandemly repeated D4Z4 DNA tracts located within the Srpt region (van Overveld et al., 2000). Combinations of these five classes of alternative subtelomeric sequence organization are likely, as are additional types not yet characterized.
The existence of large-scale subtelomere polymorphisms detected in these earlier studies was recently confirmed and extended in the much larger and comprehensive studies of Redon et al. (2006). This group characterized the genomic DNA of 269 individuals (already genotyped for SNPs as part of the HapMap project) for copy number variation using two genome-scale array CGH platforms; the 500K Affymetrix SNP chip, and a Whole Genome Tiling Path (WGTP) Array of BACs. Copy number variation regions were detected at almost every subtelomere, and in most cases these CNV regions were adjacent to and/or were co-extensive with the previously identified segmental duplication regions. Given our current understanding of the large and complex segmental duplication regions near telomeres, including duplicons occupying multiple telomeres as well as internal loci, this result is not surprising. The resolution of these and most current CNV detection platforms is generally >50 kb, and most subtelomere regions are under-represented on these platforms, so the level of detailed subtelomere CNV information one can obtain from currently available CNV databases is extremely limited. Fosmid end sequence mapping strategies (Tuzun et al., 2005) applied to subtelomere regions will provide a much more powerful method for exploring and characterizing this variation, and will yield cloned fragments corresponding to the sites of variation to analyze at high resolution. For subtelomere alleles with large and complex duplications, existing and planned BAC libraries (Eichler et al., 2007) will be needed to fully characterize these variants; however, the fosmid libraries can be used to define and characterize the boundaries of the large variant regions.
Structural variation is abundant in subterminal DNA, and is likely to have important consequences for expression of subterminal genes and TERRA RNAs as well as for regulation of (TTAGGG)n tract lengths, especially individual-specific patterns of allele-specific (TTAGGG)n tract lengths that may be crucial for the biological functions of telomeres. Both copy number variation and alternative sequence organizations for DNA immediately adjacent to terminal (TTAGGG)n tracts are major components of this structural variation.
One important near-term goal is to better understand the extent and nature of subterminal structural variation. Recently described structural variation libraries will go a long way towards addressing this gap. The construction and end-sequencing of fosmid libraries from the DNA of unrelated HapMap individuals as part of the Human Structural Variation initiative (Eichler et al., 2007) has resulted in a remarkable resource for investigation of the highly variable subtelomeric regions of the human genome. Each Fosmid End Sequence (FES) library has roughly 12-fold clone coverage of the source genome. Importantly, because the libraries were prepared using sheared source DNA, terminal fosmids with one mate pair corresponding to part of the terminal (TTAGGG)n sequence tract can be identified. Preliminary studies in our lab have demonstrated the feasibility of isolating novel subterminal structural variants using this approach and have shown that, while distal terminal (TTAGGG)n tracts are truncated in the clones, subterminal DNA remains intact (Riethman, unpublished). Many (TTAGGG)n-adjacent sequences are still missing from the reference sequence itself, and a large number of structural variants carrying new common subterminal alleles remain to be discovered and characterized. Given the importance of subterminal sequences for telomere chromatin assembly and telomere length regulation, an effort to systematically identify and characterize subterminal sequences from common telomere alleles is warranted. Such an effort will fill a major gap in our knowledge of subtelomere structures in the human genome, and will permit development of new tools for studying telomere biology.
Developing methods for tracking individual subtelomeric allele variants in the context of total genomic DNA is still a huge challenge. PCR-based approaches to distinguish similar but non-identical copies of duplicons and groups of adjacent duplicons have had limited success so far, mainly because the sequences and duplicon organization of most common subtelomere alleles in human populations are still unknown. Once more data are available, multiplex ligation-dependent probe amplification (MLPA)-like (Schouten et al., 2002) approaches for duplicon identification and quantitation in case-control cohorts can be envisioned. Similarly, STELA-based approaches (Britt-Compton et al., 2006) for analyzing single-telomere length and stability will likely be expanded as the sequences for a more complete set of subterminal alleles become available. Perhaps the most intriguing approach, however, may be to attempt specific fractionation of subtelomeric DNA from individual genomes using a microarray capture method (Albert et al., 2007; Hodges et al., 2007; Okou et al., 2007; Porreca et al., 2007), then use deep sequencing of the enriched subtelomere fraction and analysis of paired-end reads (Korbel et al., 2007; Campbell et al., 2008) to re-construct subtelomeric scaffolds of the target genome. While high similarity subtelomere duplicons would certainly pose challenges for this method, the potential depth of coverage in paired-end reads and the judicious use of sequence landmarks such as adjacent single copy DNA regions, known associations of adjacent duplicons within subtelomeric segmental duplication regions (Linardopoulou et al., 2005; Ambrosini et al., 2007), and the unique positions of pure (TTAGGG)n tracts at chromosome termini (Fig. 1) might enable efficient re-sequencing of key parts of subtelomeric and subterminal regions. Such a method would provide unprecedented levels of comprehensive detail with respect to sequence organization and structural variation for this complex but critical region of the human genome.
Acknowledgement
I thank Sheila Paul for helping with the preparation of the manuscript.
References
This work was supported by NIH grants HG00567 and HG2933, and by the Commonwealth Universal Research Enhancement Program, PA Dept. of Health.