Abstract
Genotype imputation is a process of estimating missing ge-notypes from the haplotype or genotype reference panel. It can effectively boost the power of detecting single nucleotide polymorphisms (SNPs) in genome-wide association studies, integrate multi-studies for meta-analysis, and be applied in fine-mapping studies. The performance of genotype imputation is affected by many factors, including software, reference selection, sample size, and SNP density/sequencing coverage. A systematical evaluation of the imputation performance of current popular software will benefit future studies. Here, we evaluate imputation performances of Beagle4.1, IMPUTE2, MACH+Minimac3, and SHAPEIT2+ IM-PUTE2 using test samples of East Asian ancestry and references of the 1000 Genomes Project. The result indicated the accuracy of IMPUTE2 (99.18%) is slightly higher than that of the others (Beagle4.1: 98.94%, MACH+Minimac3: 98.51%, and SHAPEIT2+IMPUTE2: 99.08%). To achieve good and stable imputation quality, the minimum requirement of SNP density needs to be > 200/Mb. The imputation accuracies of IMPUTE2 and Beagle4.1 were under the minor influence of the study sample size. The contribution extent of reference to genotype imputation performance relied on software selection. We assessed the imputation performance on SNPs generated by next-generation whole genome sequencing and found that SNP sets detected by sequencing with 15× depth could be mostly got by imputing from the haplotype reference panel of the 1000 Genomes Project based on SNP data detected by sequencing with 4× depth. All of the imputation software had a weaker performance in low minor allele frequency SNP regions because of the bias of reference or software. In the future, more comprehensive reference panels or new algorithm developments may rise up to this challenge.
Introduction
Next-generation sequencing and single nucleotide polymorphism (SNP) arrays are now two main methods to reveal the genotype information. The detection of more loci requires a larger sample size, larger sequencing depth for whole-genome sequencing, and a denser SNP array for microarray-based genotyping. Genotype imputation can be used to solve this dilemma by predicting untyped genotypes from the haplotype reference panel. It is now being widely used in genome-wide association studies (GWASs) to find novel risk alleles [1, 2], in fine-mapping to get a high-resolution view to increase the likelihood of identifying causal variants [3], and in integrating studies across different platforms for meta-analysis [4, 5]. Now, national large-scale whole-genome sequencing projects are being carried out in many countries, including the United Kingdom [6], Iceland [7], Japan [8], the Netherlands [9], and Singapore [10]. Phasing and imputation have been widely used in these projects to form and assess the haplotype reference panel. Furthermore, the haplotype reference panel generated from these projects can also be used in future imputation studies [11-14].
The performance of genotype imputation is affected by many factors, such as software, reference selection, SNP density (see respective section in “Methods”), sample size, and sequencing coverage. A series of assessments of genotype imputation performance has previously been performed. Huang et al. [16] found that the reference selection, all or part of the International HapMap Project haplotype reference panel, should consider the haplotype number of the reference and the population similarity between the reference and the study set when using MACH [15] for imputation. Nho et al. [18] showed that the effect of reference selection on imputation performance varies across software (MACH and IMPUTE2 [17]) for populations of the USA and Canada. Biernacka et al. [19] reported that imputation accuracy was proportional to the linkage of SNPs. Gao et al. [20] and Zheng et al.[21] found that imputation accuracy was lower in low minor allele frequency (MAF) regions for MACH and IMPUTE2. Zhang et al. [22] found that the sample untype rate of a study set was correlated to imputation accuracy. Here, we added the newly developed and widely used software Beagle4.1 and Minimac3 to our evaluation. Popular software such as Beagle4.1 [23], IMPUTE2, MACH, Minimac3 [24], and SHAPEIT2 [25] has been used in the 1000 Genomes Project [11] and many national large-scale whole-genome sequencing projects [6-10] (Table 1).
Comparisons of software can guide the future of imputation studies. Critical factors of reference selection are haplotype number in the reference panel and the similarity of populations between the reference and study [16]. We chose the haplotype reference of the 1000 Genomes Project as it has more populations and haplotype numbers than HapMap. The sample size and sequencing coverage/SNP density are also related to imputation accuracy. We estimated the performance of the current popular imputation software with all these factors: reference selection, sample size, SNP density, and sequencing coverage. As the low-frequency genetic variants in a population prefer to associate with disease [26], we also evaluated the imputation performance for low MAF SNPs compared to common ones. Furthermore, we analyzed the reasons behind the different performance.
Methods
Study Data
Study sets were selected from Chinese individuals in the International HapMap Project and the 1000 Genomes Project. Only chromosome 1 and 22 were chosen to evaluate the performance of genotype imputation with considering efficiency. 43 individuals were genotyped in all three phases of HapMap, with a denser SNP density (Table 2). 130 individuals were genotyped from Illumina Human1M and Affymetrix SNP 6.0 in HapMap phase III. The National Center for Biotechnology Information (NCBI) genome build of study sets was transformed from 36 to 37 (see below). 10 individuals were sequenced in the 1000 Genomes Project with an averaging coverage of 15× (12× to 16×). Raw sequencing reads were downloaded and aligned to the reference genome (GRCh37) using Burrows-Wheeler Aligner [27]. PCR duplications were removed by Picard [28]. Genome Analysis Toolkit [29] was used to generate high-quality variants. Study data of low-coverage sequencing were simulated by subsampling the raw reads of the 10 individuals’ alignment file using SAMtools [28]. Two alignment files of average sequencing coverage 4× (3× to 4×) and 7× (6× to 8×) were generated, followed by variant calling.
Quality Control
Quality control was performed on data using PLINK [30]. The sample filter criterion was: sample call rate < 97%. SNP filter criteria were as follows: (i) SNP call rate < 97%, (ii) Hardy-Weinberg p value < 10–6, and (iii) inconsistent SNP sites with the 1000 Genomes Project. The quality control was performed for all samples in the study set. All samples passed the filtration. After quality control step, 183,676 SNPs for chr1 and 31,796 SNPs for chr22 of the 43 samples and 103,797 SNPs for chr1 and 18,510 SNPs for chr22 of the 130 samples in the study set remained. For 10 sequencing samples, only the inconsistent SNPs and SNPs deviating from the Hardy-Weinberg equilibrium were removed.
Imputation Strategy
Several phasing and imputation software programs have been developed to achieve a high imputation accuracy and limited computational burden. Popular software such as Beagle4.1, IMPUTE2, MACH, Minimac3, and SHAPEIT2 has been used in the 1000 Genomes Project and several national large-scale whole-genome sequencing projects in the United Kingdom, Iceland, Japan, the Netherlands, and Singapore. The main algorithm of Beagle4.1 is a hidden Markov model (HMM), which uses a clustering graphical model on haplotypes. IMPUTE2 method is a two-step algorithm comprising phasing and imputation. First, it infers haplotype conditioning from information of the study sample, reference, and recombination rate using a Markov Chain Monte Carlo approach. Second, it uses an HMM to impute the missing genotypes based on the haplotypes of the study sample inferred from the phasing step. IMPUTE2 iterates these two steps to maximize the posterior probabilities of the missing alleles for imputation. Minimac3 uses state space reduction HMM to reduce the running time. MACH is a phasing method using an HMM model. SHAPEIT2 is a haplotype estimation method using an HMM model on a graph structure of haplotypes. These tools are the most commonly used in genotype imputation. As the previous studies suggested, MACH can be used with Minimac3 for pre-phasing [24], and IMPUTE2 can use SHAPEIT2 for pre-phasing [31]. Thus, we performed four imputation methods, which were Beagle4.1, IMPUTE2, SHAPEIT2+IMPUTE2, and MACH+Minimac3. All software was run with default parameters.
As for the reference, we should consider sample number and population similarity between the reference and the study set simultaneously. Compared to the International HapMap Project, the 1000 Genomes Project with next-generation sequencing data is now the primary source for reference panels with more variants, individuals, and various populations. The 1000 Genomes Project Phase 3 contains 84.4 million variants of 2,504 individuals from 26 populations. We employed the 1000 Genomes Project as a reference panel (1KG_ALL). Considering the similarity of the populations, we introduced 504 East Asian individuals from the 1000 Genomes Project as another reference panel (1KG_EAS) to evaluate the influence of reference selection on imputation performance.
In addition to software and reference selection, we also considered other factors that affected imputation performance. In order to evaluate the effect of SNP density in study sets, we subsampled SNPs by sorting SNPs based on position and extracted m SNPs (m = 1, 1.5, 2, 2.5, 3–6, 9, 12, 15, 18, 21, 24) for every 36 SNPs to generate 14 subsets with different SNP density levels. After extracting 24 SNPs for every 36 SNPs, the remaining SNPs were selected as test set. Considering software, reference selection, and SNP density, these three factors formed 112 different combinations; here, we evaluated their influence on imputation. After comprehending the correlation result of these three factors, we further estimated the effect of sample size on imputation using 6 levels (10, 20, 40, 80, 100, and 130). The study sets with different sample sizes were sampled from 130 HapMap phase III without replacement. The bigger sample size study sets were generated by adding new individuals to smaller one. As next-generation sequencing data is a mainstream method for large-scale population sequencing, we also evaluated the imputation performance with different sequencing coverage (Fig. 1).
Imputation Measurement
The accuracy was represented by the concordance rate, the percentage of correctly imputed genotypes of the test set. We also used r2 [squared Pearson correlation coefficient between imputed genotype dosages in (0–2) and masked sequence genotypes in (0, 1, 2)] to represent the imputation accuracy. In addition to the direct comparison of imputation accuracy, we also analyzed the discordance rate versus the missing rate under different thresholds of imputation genotype posterior probabilities. Imputation accuracy was assessed for different MAFs. We emphasized the assessment of the imputation performances between common (≥ 5%) and low (< 5%) frequency groups. In addition to comparing the imputation accuracy directly, we calculated the sensitivity and false-positive rate (FPR) of these two groups (see below). We ran all imputation strategies twice on the same Linux server, which has two Intel Xeon X5650 processors (running at 2.66 GHz, with a 12 MB cache, and using a 64-bit architecture) and used the average value to assess the computational burdens, including running time and computational memory.
SNP Density
SNP density is the percentage of SNP number per megabase (MB).
Converting HapMap Build36 Data to Build37
First, Genome Analysis Toolkit was used to convert the HapMap txt file into a vcf file. Then, liftover was used to convert the build36 vcf file into a build37 vcf file.
Imputation Quality Measurement
The discordance rate versus the missing rate on different threshold of imputation genotype posterior probabilities is one of the measurements to evaluate the performance. The discordance rate is the percentages of discordance between imputed genotypes and masked genotypes. The missing rate is the percentage of no calls made. Sensitivity and FPR are formulated in a 2 × 2 contingency table (Table 3).
Results
Imputation Performance Affected by Software, Reference Selection, and SNP Density
Imputations were performed on 43 individuals using Beagle4.1, IMPUTE2, MACH+Minimac3, or SHAPE-IT2+IMPUTE2 under different reference selection and SNP density conditions (Fig. 1, Part 1). The imputation accuracy generated by each piece of software was higher than 85% on the test sets, chr1, and chr22 (Fig. 2a, b). The results from chr1 and chr22 are consistent; therefore, only the results for chr1 are shown below. IMPUTE2 performed slightly better than the others. The advantages of IMPUTE2 are reflected in using the reference information both in phasing and imputation compared to SHAPEIT2+IMPUTE2, directly applying an HMM on haplotypes compared to Beagle4.1, and choosing the closest subset of haplotypes with the study set compared to MACH.
These are probable causes for IMPUTE2’s better performance. Different software programs preferred different reference panels: 1KG_EAS was better suited to MACH+ Minimac3 and SHAPEIT2+IMPUTE2, and 1KG_ALL was better suited to Beagle4.1 (Fig. 2c, d). IMPUTE2 performed stably on both 1KG_EAS and 1KG_ALL. The results suggested that software with a pre-phasing method, MACH+ Minimac3 and SHAPEIT2+IMPUTE2, performed better with higher population similarity between the study and the reference. As SNP density increased, the imputation accuracy increased gradually in all software for microarray-based genotyping. The imputation result from chr1 and chr22 claimed that when SNP density reached 200/Mb, the accuracy tended to saturate.
As for computational burden, IMPUTE2 was the most time-consuming, while SHAPEIT2+IMPUTE2 was the fastest (Fig. 3a). Except for Beagle4.1, the running times of the other software programs were influenced only by SNP number and sample number of the reference, instead of SNP density in the study set (Fig. 3b). This limitation is because the SNP number and sample number of the reference are much larger than those of the study set. The required computer memory for the software was not affected by SNP density and was sensitive to the reference (Fig. 3a, c).
Effect of Sample Size
After a general assessment of the software, we chose better-performing software, IMPUTE2, Beagle4.1, and SHAPEIT2+IMPUTE2 with an SNP density 278/Mb, to evaluate the influence of sample size. IMPUTE2 and Beagle4.1 were generally not affected by sample size as compared to SHAPEIT2+IMPUTE2 (Fig. 4a). With expanded sample size, IMPUTE2 was still the best performing software and was stable on both references (Fig. 4b). As for the running time of the software, Beagle4.1 and SHAPEIT2+IMPUTE2 were much more insensitive to sample size (Fig. 4c).
Imputation Accuracy with MAF
In GWASs, low-allele frequency SNPs tend to be disease-associated [32, 33]. Thus, the imputation of low-frequency alleles is more important. As the MAF decreased, the accuracies of all software also decreased (Fig. 5a). IMPUTE2 and Beagle4.1 worked better with low-frequency SNPs. Usually, software utilizes linkage disequilibrium of SNPs to impute missing genotypes, and imputation accuracy is proportional to SNP linkages [34]. Here, we divided SNPs into two groups by their MAF: < 5% and ≥5%. Comparing the linkage disequilibrium rate between the two groups, we found the average linkage disequilibrium rate (the mean of the correlation coefficients r2 between SNPs calculated by PLINK) of the < 5% group was not lower than that of the ≥5% group (0.691 vs. 0.670). Therefore, in our study, low imputation accuracy of low MAF SNP was not caused by the linkage disequilibrium rate.
The study set or reference is actually a subset of all populations. Sampling error can cause discordance between study set and reference. In mathematics, sampling error is proportion to the variability of the data set. As rare SNPs have more variability, the discordance between study set and reference of rare SNPs is bigger than that between study set and reference of common SNPs. So, this might be the reason why imputation accuracies of rare SNPs are poorer, the bias of reference rather than the bias of the linkage disequilibrium rate. The sensitivity and FPR in the < 5% group were both lower than in the ≥5% group, which suggested that the software preferred to impute the low MAF alleles to be the same allele as the human genome reference, no matter what the genotypes really were (Fig. 5b and Table 4). Therefore, software bias could also be the reason for the low accuracy of low MAF SNPs.
Effect of Sequencing Coverage
Imputation is an essential part of genotype detection of low-coverage next-generation sequencing projects. Therefore, we assessed the imputation performance with different sequencing depths in a low-coverage sequencing project. The identified SNPs of 10 individuals with 4× and 7× coverage were mostly consistent with 15× (Fig. 6). Compared with 4× or 7×, most of the 15× specified SNP could be imputed from 1KG_ALL. In theory, the SNP data set with 15× could be inferred based on data of 4× or 7×. Imputation was performed on SNP data of 4× and 7× from 1KG_ALL using IMPUTE2 because of its better performance on 10 individuals. The SNP density of 4× and 7× is 1,384/Mb and 1,825/Mb, respectively. The average imputation accuracies of IMPUTE2 at 4× and 7× coverage were 90.30 and 90.56%, respectively. With a study set SNP density over 200/Mb, imputation accuracy with 4× or 7× coverage in next-generation sequencing was lower than the microarray study. The reasons might be the lower sample typed rate (78.95% of 4× and 94.94% of 7×) [22] and the influence of inconsistently called SNP loci (1.56% between 4× and 15×, and 1.44% between 7× and 15×).
Discussion
In this study, we systematically estimated the effect of software, reference selection, sample size, SNP density/sequencing coverage, and MAF on imputation. When SNP density reached 200/Mb for microarray-based genotyping, imputation accuracy tended to be stable and Beagle4.1, IMPUTE2, MACH+Minimac3, and SHAPEIT2+ IMPUTE2 all achieved high imputation accuracy. IMPUTE2 provided the best accuracy but was also the most time-consuming and used the most memory, since it adopts as much biological information as possible in the phasing and imputation steps. SHAPEIT2+IMPUTE2 was the fastest and most memory-saving imputation method with little sacrifices to accuracy compared with IMPUTE2, which might make it more practical for large data sets generated by next-generation sequencing [31]. As for reference selection, a reference that has a high population similarity with the study set was more suitable for MACH+Minimac3 and SHAPEIT2+IMPUTE2. Beagle4.1 performed better with a reference that has a large sample number. IMPUTE2 was relatively stable regardless of reference selection. The performances of IMPUTE2 and Beagle4.1 were stable regardless of sample size.
We found that all tested software worked poorly for the imputation of low MAF SNPs. The cause was not the linkage disequilibrium rate but the bias of references and software. For low MAF SNPs, discordance information provided by the reference duo to the bias could cause imputation accuracy to decrease. According to sensitivity and FPR, the software preferred to impute the missing genotypes to be the same as the human genome reference at low MAF SNPs. The problem with the reference can be eased by increasing the sample number of the reference, especially the population-specific samples [16]. Fortunately, population-specific haplotype reference panels have been constructed in several countries. For the software, the problem of improving the imputation accuracy at low MAF SNPs is challenging and warrants considerable attention. A new algorithm needs to be developed to improve the accuracy on low MAF SNPs.
As next-generation sequencing is now the main way for GWASs, imputation on sequencing-provided data needs extra attention [13]. The imputation on low-coverage sequencing is challenging mostly because of the low sample typed rate and inconsistently called SNP loci. The inconsistently called SNPs may be filtered by strict quality control standards. For the low sample typed rate of a study set, there are two ways to solve the problem: deeper sequencing or higher quality references. Increased sequencing depth comes with more sequencing costs, and the existing references are limited. Therefore, improvement of imputation software and some newly developed algorithm could be a direct solution to this problem.
Conclusions
The effects of software, reference selection, SNP density, and sample size on imputation performance were evaluated to give a general and practical guidance for future imputation studies. We found that IMPUTE2 was the most accuracy and stable method. SHAPEIT2+IMPUTE2 was the fastest method with some accuracy sacrifice. To get a reliable imputation result, SNP density should be > 200/Mb for microarray-based genotyping. Imputation performance on SNPs generated by next-generation whole genome sequencing was assessed, we found that SNP sets detected by higher sequencing depth (15×) could be mostly available by imputing from the haplotype reference panel of the 1000 Genomes Project based on SNP data of lower sequencing depth (4×). All of the imputation software had a weaker performance in low MAF SNP regions because of the bias of reference or software, so more comprehensive reference panels or new algorithm developments may rise up to this challenge.
Acknowledgements
The authors thank Dr. Jun Yu and Dr. Songnian Hu for their valuable discussions on this work.
Statement of Ethics
The work used open access data of the 1000 Genomes Project and the International HapMap Project.
Disclosure Statement
The authors declare that they have no competing interests.
Funding Sources
This work was supported by the National Key Research Program of China (2016YFB0201702; 2017YFC0907503 to J.X. and 2016YFC0901903 to Z.D.); the National Natural Science Foundation of China (31471248 to J.X.); the Key Program of the Chinese Academy of Sciences (KJZD-EW-L14 to J.X.); the Youth Innovation Promotion Association CAS (grant to J.W.); and funding for open access charge: the National High-tech R and D Program (2015AA020101 to Z.D.).