Introduction: Ideally, evaluating next-generation sequencing performance requires a gold standard; in its absence, concordance between replicates is often used as substitute standard. However, the appropriateness of the concordance-discordance criterion has been rarely evaluated. This study analyses the relationship between the probability of discordance and the probability of error under different conditions. Methods: This study used a conditional probability approach under conditional dependence then conditional independence between two sequencing results and compares the probabilities of discordance and error in different theoretical conditions of sensitivity, specificity, and correlation between replicates, then on real results of sequencing genome NA12878. The study examines also covariate effects on discordance and error using generalized additive models with smooth functions. Results: With 99% sensitivity and 99.9% specificity under conditional independence, the probability of error for a positive concordant pair of calls is 0.1%. With additional hypotheses of 0.1% prevalence and 0.9 correlation between replicates, the probability of error for a positive concordant pair is 47.4%. With real data, the estimated sensitivity, specificity, and correlation between tests for variants are around 98.98%, 99.996%, and 93%, respectively, and the error rate for positive concordant calls approximates 2.5%. In covariate effect analyses, the effects’ functional form is close between discordance and error models, though the parts of deviance explained by the covariates differ between discordance and error models. Conclusion: With conditional independence of two sequencing results, the concordance-discordance criterion seems acceptable as substitute standard. However, with high correlation, the criterion becomes questionable because a high percentage of false concordant results appears among concordant results.

Lately, next-generation sequencing (NGS) costs decreased dramatically allowing its wider use. However, evaluating and interpreting NGS results (e.g., accuracy) are still debated.

One NGS objective was identifying differences between individuals’ genomes and the reference human genome; i.e., identifying non-reference base pairs (bps) or variants. Generally, in whole-genome sequencing (WGS), the expected proportion of variants is 1–2 per kb DNA, whereas the error rate was estimated at 1 in 10–15 kbs in raw genotype calls and 1 in 100–200 kbs for post-filtered calls [1, 2], which are rather low rates. Given that WGS explores nearly 3 billion bps per individual, those error rates lead to non-negligible tens of thousands of errors.

Evaluating WGS overall performance (sequencing and pipeline analysis) requires estimating performance metrics: sensitivity (Se), specificity (Sp), positive predictive value (PPV), etc. One approach compares sequencing results versus a “gold standard” using an established reference standard. NA12878, the genome of a healthy female donor with European ancestry, the daughter in a father-mother-child “trio” is the foremost human genome reference standard. One of the widely used “gold standard” for the NA12878 genome is the high-confidence variant set established by the Genome in a Bottle Consortium (GIAB) [3, 4]. However, in real individual data, sequencing result validation can be complicated by the lack of gold standard; evaluating Se requires external data (e.g., genotype array data) as source of complete variant set [5] and evaluating Sp or PPV requires either an experimental validation via Sanger sequencing (few sites) or “target enrichment sequencing” (numerous sites).

Evaluating performance may also use platform or pipeline result comparisons. Finding discordant variants implies errors, whereas concordant variants suggest low error probability. Reumers et al. [6] defined each discordance between sequences of monozygotic twins as “error,” each concordance as “truth,” and carried out experiments to confirm assumptions of error or truth; selected shared single nucleotide variants (SNVs) were all confirmed by genotyping, indicating a very low error rate among concordant SNVs. Lam et al. [7] compared the results of two platforms on biological replicates and found 88.1% concordance among all variants detected by at least one platform. Later, Ratan et al. [8] validated 92.7% of concordant SNVs and 60% of platform-specific variants providing higher reliability of concordant results; then, on three platforms, showed 64.7% concordance rate across platforms. Nevertheless, to confirm concordance, biological, or technical replicates were sometimes used [7, 9] and, often, the number of difficult-to-define true negatives is not available and Sp cannot be estimated [10].

One difficulty is that accepted validations with Sanger sequencing or GIAB benchmarking have limitations; e.g., the former’s Se is still questioned [11] and the established GIAB reference sets that evaluate pipeline performance are mostly based on high-confidence regions, which might overestimate variant calling accuracy. For example, only 74.6% of exonic bases in ClinVar and OMIM genes belonged to high-confidence regions [12]. This led to recommend evaluation by comparisons between samples from a parent-offspring trio or same individual [13, 14]. Other methods that use external datasets would also yield biased Se and Sp estimates, especially in easy-to-sequence genome regions [15].

Factors associated with the error rate have been explored. Genome regions complexity and heterogeneity make NGS performance inconsistent and erroneous realignment in low-complexity repetitive DNA regions was identified as main source of errors [13, 16]. Also, empirical models examined the relationship between read depth and Se or error rate estimates [5, 8, 9]. Other error rate-related factors are low genotype quality [17], high guanine-cytosine content, and proximity to other SNVs [8, 18], to the centromere or the telomeres. Other studies modelled the relationships between several covariates and NGS performance: (i) generalized linear model with 23 parameters was used to separate true positive (TP) from false positive (FP) calls in [19]; (ii) variant-free simulated reads explored the relationship between FP calls number and seven covariates in [20]; (iii) the effects of twelve factors on the error rate were combined to better filter errors in [6].

Thus, evaluating DNA sequencing performance with real persons’ data suffers from absence of gold standard. Discordance between technical (experimental) replicates, biological (specimen) replicates, or results from several pipelines or platforms is used to reduce the error rate. However, whether the discordance rate (discordant calls over all assessed calls) corresponds to the error rate remains unclear.

This study examined the relationships between error and discordance in various situations. After some theoretical relationships, the error rate and the discordance rate are calculated with data on three NA12878 genome replicates. Finally, differences between estimates of covariate effects associated with error and discordance are examined.

Data Source

The data are calling results from three NA12878 replicates sequenced by Illumina NovaSeq 6000 system platform and aligned with Burrow-Wheeler Aligner (BWA-MEM) [21] against GRCh37 version of the human reference genome. The three replicates were joint-genotyped by GATK [22]. Variant calling adopted GATK Best Practices recommendations [23, 24]. The latest version (v4.2.1) of GIAB variant calling benchmark data regions was used as “gold standard,” it has a higher coverage of the reference genome and includes more difficult-to-map regions than previous versions [25, 26].

Error and Discordance

Here, an error was either a variant in the gold standard set called as non-variant by the query results (i.e., a false negative, FN) or a non-variant called as variant (i.e., a false positive, FP). This recalls the “local match” “for which any site in the query with a nearby truth variant is counted as a TP, even if alleles and genotypes differ” as in [10]. For a given individual, at a given DNA bp position, whatever the genotype or allele, two calling results were considered “concordant” whenever both were identified as variants or non-variants and “discordant” whenever one was identified as variant and the other as non-variant.

An agreement between two samples from the same individual was considered as substitute for gold standard. The agreement/disagreement status was analysed as response variable Y (0 for agreement or 1 for disagreement).

Probabilities of Error and Discordance

Let A and B be two sequencing processes with sensitivities SeA and SeB and specificities SpA and SpB, and let ηv.ab denote the conditional probability of sequencing result A = a (a = 0 or 1), B = b (b = 0 or 1) with V = 1 for variants and V = 0 for non-variants.

For variants, the error rate comes to the FN rate; i.e., (1–Se) = the number of FNs/the number of variants. For non-variants, the error rate comes to the FP rate; i.e., (1–Sp) = the number of FPs/the number of non-variants (online suppl. Table S1; for all online suppl. material, see https://doi.org/10.1159/000538401).

For any variant, the probability of correct classification by A and B is η1.11, that of misclassification by A and B η1.00, and that of misclassification by either A or B η1.10 + η1.01 (i.e., discordance between A and B) (online suppl. Table S2). For any non-variant, those probabilities may be written η0.00, η0.11, and η0.10 + η0.01.

Therefore, for all bps in the genome, π being the prevalence of a variant [π = P (V = 1)], the expected probability Pab may be written:
Pab=PA=a,B=b=π×η1.ab+1π×η0.ab
(1)

Consequently, considering all bp results (variants and non-variants), the probability of concordance will be P11 + P00 and that of discordance P10 + P01. A discordant pair of results means one is an error and concordant pair means both results are correct or both are errors.

Error and Discordance under Conditional Independence

With same notations and under an assumption of conditional independence given the true status of bps (either variants or non-variants), the conditional probabilities for any variant are η1.11 = SeA × SeB, η1.00 = (1 – SeA) × (1 – SeB), and η1.10 + η1.01 = SeA × (1 – SeB) + (1 – SeA) × SeB. The conditional probabilities for any non-variant, the equations are the same but with SpA and SpB instead of SeA and SeB.

Therefore, for all bps (online suppl. Table S2), the expected probabilities for P11, P00, P10, and P01 can be calculated using equation (1), for example:
P11=π×SeA×SeB+1π1SpA1SpB

Error and Discordance Accounting for Correlation

The conditional independence assumption does not correspond to real situations. At a certain bp position, because of the covariates that influence the NGS accuracy, the error rates of two different tests that use the same pipeline may be correlated.

In case of non-independence, η1.ab and η0.ab can be formulated with additional parameters that represent that correlation. Here, one parameter was assigned to variants and another to non-variants.

Conditional Dependence between Two Sequencing Processes A and B with Covariance

Let Cov1 and Cov2 denote the covariances of the two sequencing tests for V = 1 and V = 0, respectively. For variants, the conditional probabilities for the combination of the two test results are the following (for non-variant, use Cov2):
η1.11=SeA×SeB+Cov1η1.00=1-SeA1-SeB+Cov1
η1.10=SeA1-SeB-Cov1η1.01=1-SeASeB-Cov1
The final probabilities for all bps can be calculated using equation (1). The relationships between correlation, covariance, and Se for variants and non-variants are then:
Cor1=Covariance1SeA1SeA×SeB1SeBandCor2=Covariance2SpA1SpA×SpB1SpB

Illustration with Usual NGS Performance Indicators

Estimated performance values from the literature were used for the above-mentioned probabilities. The Se of detecting variants is estimated at 90–99% (95.4% in [12], 98.66% in [27], 99.3% in [7]) and Sp at 99.9–99.99% [6, 13]. The whole-genome bp number was set at 3 × 109.

Under conditional independence, Se is set at 90% and 99% and Sp at 99% and 99.9%. The expected prevalence of any variant is set at 0.1%; i.e., the expected number of variant bps was 3 × 106. The PPVs and NPVs were calculated for each Se/Sp combination.

Under conditional dependence, adding two parameters for dependence led to seven parameters in the probability model: SeA, SeB, SpA, SpB, Cor1, Cor2, and π. Here, Se was set at 90% and 99%, Sp at 99% and 99.9%, and the expected prevalence of any variants at 0.1% and 0.2%. Three correlation levels between replicates were considered: 30%, 50%, and 90% for low, intermediate, and high.

Analysis of Real Data

Analyses of real WGS data considered only bps from the GIAB benchmark region, each bp position being a “statistical unit” and each GIAB benchmark result a true status of each bp. On these bps, two analyses were performed: (i) performance analysis by comparing results from each replicate with the truth set (online suppl. Table S1); and, (ii) concordance analysis by comparing results between any two replicates using a 2 × 2 contingency table (online suppl. Table S2). Here, only performance for SNVs was analysed.

Performance evaluation used the above-provided definitions of TPs, FPs, FNs, and true negatives. No-calls in VCF file were considered as non-variants. N being the number of homozygote reference bps in the GIAB benchmark call set, Se was calculated as TPs/(TPs + FNs), Sp as (1 – FPs)/N, and the PPV as TPs/(TPs + FPs).

Concordance analysis used “local match.” Concordance rate and correlations were calculated for real variant positions, real non-variant positions, and all positions.

Analysis of Covariate Effects

Using the GIAB benchmark set as gold standard, generalized additive models were built with a logistic link to estimate covariate effects on the probabilities of discordance and error. Mathematically, this model may be written:
LogitP=β0+m=1MfmXijm+ϵij
(2)
i is the number of replicates, j the bp position, Xm a covariate (m = 1,…, M), β0 the intercept, εij the Gaussian error, and P = P (Y) with Y = 1 for discordance or error and Y = 0 for concordance or correct call. Here, the concordance for replicates 1, 2, and 3 was, respectively, defined as the concordance between replicates 1 and 2, replicates 2 and 3, and replicates 3 and 1.

The modelling used function “bam” of mgcv R package adapted to large dataset analyses. Smoothing used a natural cubic regression spline and the smoothing parameter was estimated (by default) by the fast restricted maximum likelihood (REML). “bam” function uses penalized iteratively re-weighted least squares (PIRLS) with a single iteration in model fitting, a method similar to “performance-oriented iteration” [28, 29].

The covariates included were:

  • a.

    The depth of coverage (DP); i.e., number of informatics reads covering a given bp. Sites with DP >100 were excluded due to high probabilities of mapping artefacts [13].

  • b.

    The allele balance (AB); i.e., number of reads supporting the alternative allele divided by the number of all informatics reads at a specific site.

  • c.

    The genotype quality (GQ); i.e., the Phred-scaled confidence for the called genotype (0–99).

  • d.

    The QualByDepth (QD); i.e., the site-level Phred-scaled confidence for the existence of variant divided by the number of reads supporting the alternative allele in variant samples.

  • e.

    The mapping quality (MQ); i.e., the root mean square of the MQ of reads across all samples.

Covariates DP, AB, and GQ were obtained from the VCF file for each bp in each sample (here, replicate). MQ and QD were obtained from the VCF file for each bp and had the same values across three samples.

Univariate and multivariate analyses were conducted. In the latter, the optimal model was selected using Akaike Information Criterion. The percentages of deviance explained by the models were estimated.

Performance Comparisons under Conditional Independence

Here, as example, for any variant, with 99% Se of calling results, the probability of discordance (P10 + P01) would be 1.98% and that of false concordance (P00) 0.1998%. For any non-variant, with 99.9% Sp, P10 + P01 would be 0.1998% and P00 0.0001% (or 10−6) (online suppl. Table S3).

For all bps, P10 + P01 would be nearly 0.1998%, with nearly 6 × 104 variants and 6 × 106 non-variants. The number of bps called as concordant variants would be 2.94 × 106, of which 3 × 103 bps would be actually non-variants; i.e., the PPV for a positive concordant pair = 99.9%. The probability of error for a negative concordant pair = 1/105; i.e., NPV = 99.999% (Table 1, column 1).

Table 1.

Predictive values of test results according to the true base pair statuses in various Se and Sp values

Test results and true base pair status99.0% Se and 99.9% Sp99.0% Se and 99.0% Sp90.0% Se and 99.9% Sp90.0% Se and 99.0% Sp
A = 0 and B = 0     
 V = 0 99.99999% 99.999% 99.99999% 99.999% 
 V = 1 0.00001% 0.001% 0.00001% 0.001% 
A = 1 and B = 1     
 V = 0 0.1% 9.4% 0.1% 11.1% 
 V = 1 99.9% 90.6% 99.9% 88.9% 
A = 1 and B = 0     
 V = 0 99% 99.9% 91.7% 99.1% 
 V = 1 1% 0.1% 8.3% 0.9% 
A = 0 and B = 1     
 V = 0 99% 99.9% 91.7% 99.1% 
 V = 1 1% 0.1% 8.3% 0.9% 
Test results and true base pair status99.0% Se and 99.9% Sp99.0% Se and 99.0% Sp90.0% Se and 99.9% Sp90.0% Se and 99.0% Sp
A = 0 and B = 0     
 V = 0 99.99999% 99.999% 99.99999% 99.999% 
 V = 1 0.00001% 0.001% 0.00001% 0.001% 
A = 1 and B = 1     
 V = 0 0.1% 9.4% 0.1% 11.1% 
 V = 1 99.9% 90.6% 99.9% 88.9% 
A = 1 and B = 0     
 V = 0 99% 99.9% 91.7% 99.1% 
 V = 1 1% 0.1% 8.3% 0.9% 
A = 0 and B = 1     
 V = 0 99% 99.9% 91.7% 99.1% 
 V = 1 1% 0.1% 8.3% 0.9% 

Reading example: A = 0 means test A negative and V = 0 means non-variants. In all four conditions, prevalence π = 0.1%.

Performance Comparisons under Conditional Dependence

Table 2 shows the probabilities of calling results conditional on variant/non-variant status with various combinations of Se and Sp values and various conditional correlations (30%, 50%, or 90%).

Table 2.

Theoretical probabilities of test results for variants and non-variants under different conditions of Se, Sp, and degree of correlation

ConditionsCorrelation 90%Correlation 50%Correlation 30%
Variants (V = 1)    
 Se 99%    
  P11 98.9% 98.5% 98.3% 
  P01 + P10 0.2% 1.0% 1.4% 
  P00 0.9% 0.5% 0.3% 
 Se 90%    
  P11 89% 86% 83% 
  P01 + P10 2% 8% 14% 
  P00 9% 6% 3% 
Non-variants (V = 0)    
 Sp 99.9%    
  P11 0.09% 0.05% 0.03% 
  P01 + P10 0.02% 0.1% 0.14% 
  P00 99.89% 99.85% 99.83% 
 Sp 99%    
  P11 0.91% 0.51% 0.3% 
  P01 + P10 0.18% 0.98% 1.4% 
  P00 99.01% 98.5% 98.3% 
ConditionsCorrelation 90%Correlation 50%Correlation 30%
Variants (V = 1)    
 Se 99%    
  P11 98.9% 98.5% 98.3% 
  P01 + P10 0.2% 1.0% 1.4% 
  P00 0.9% 0.5% 0.3% 
 Se 90%    
  P11 89% 86% 83% 
  P01 + P10 2% 8% 14% 
  P00 9% 6% 3% 
Non-variants (V = 0)    
 Sp 99.9%    
  P11 0.09% 0.05% 0.03% 
  P01 + P10 0.02% 0.1% 0.14% 
  P00 99.89% 99.85% 99.83% 
 Sp 99%    
  P11 0.91% 0.51% 0.3% 
  P01 + P10 0.18% 0.98% 1.4% 
  P00 99.01% 98.5% 98.3% 

P11, probability of positive concordance between two tests; P00, probability of negative concordance; P01 + P10, probability of discordance.

For example, for any variant, with 99% Se and 90% correlation, P10 + P01 would be 0.2% and P00 0.9%.; and, for any non-variant, with 99.9% Sp and 90% correlation, P10 + P01 would be 0.02%, and P00 0.09%. In this example, P00 is much higher than that of discordance; this means that most errors are common to both calling results. With the same Se or Sp value, higher correlations resulted in higher proportions of concordance results (both concordant correct results and concordant errors) and lower proportion of discordant results.

With the above-shown Se, Sp, and correlation values and π = 0.1%, P01 + P10 would be 0.02%; i.e., discordant bps = 5.9 × 105, among which 5.9 × 103 would be variants and the others non-variants. The number of positive concordant bps would be 5.7 × 106, of which 2.7 × 106 would be non-variants, thus for a positive concordant pair, PPV = 52.6%. The number of negative concordant bps would be 3 × 109, of which 2.7 × 104 would be variants (Table 3, column 1). These predictive values are influenced by the correlation level: with the same performance indicators, higher correlation values result in lower PPV for concordant positive bps and lower NPV for concordant negative bps (comparing columns 1, 2, and 3). The true status probability is not much influenced by performance indicators or correlation levels in this case, but a higher prevalence of variants does result in a higher variant probability for discordance bps (comparing column 4 vs. 2).

Table 3.

Predictive values of test results for variants and non-variants under different conditions of Se, Sp, correlation, and variant prevalence

Test results and true base pair statusJoint conditions of Se, Sp, correlation, and variant prevalence
Se = 99%Se = 99%Se = 99%Se = 99%Se = 90%
Sp = 99.9%Sp = 99.9%Sp = 99.9%Sp = 99.9%Sp = 99%
Cor1 = 90%Cor1 = 50%Cor1 = 30%Cor1 = 50%Cor1 = 90%
Cor2 = 90%Cor2 = 50%Cor2 = 30%Cor2 = 50%Cor2 = 90%
π = 0.1%π = 0.1%π = 0.1%π = 0.2%π = 0.1%
A = 0 and B = 0      
 V = 0 99.999% 99.9995% 99.9997% 99.999% 99.999% 
 V = 1 0.001% 0.0005% 0.0003% 0.001% 0.001% 
A = 1 and B = 1      
 V = 0 47.4% 34% 23.1% 20.3% 90% 
 V = 1 52.6% 66% 76.9% 79.7% 10% 
A = 1 and B = 0      
 V = 0 99.1% 99% 99% 98% 98.9% 
 V = 1 0.9% 1% 1% 2% 1.1% 
A = 0 and B = 1      
 V = 0 99.1% 99% 99% 98% 98.9% 
 V = 1 0.9% 1% 1% 2% 1.1% 
Test results and true base pair statusJoint conditions of Se, Sp, correlation, and variant prevalence
Se = 99%Se = 99%Se = 99%Se = 99%Se = 90%
Sp = 99.9%Sp = 99.9%Sp = 99.9%Sp = 99.9%Sp = 99%
Cor1 = 90%Cor1 = 50%Cor1 = 30%Cor1 = 50%Cor1 = 90%
Cor2 = 90%Cor2 = 50%Cor2 = 30%Cor2 = 50%Cor2 = 90%
π = 0.1%π = 0.1%π = 0.1%π = 0.2%π = 0.1%
A = 0 and B = 0      
 V = 0 99.999% 99.9995% 99.9997% 99.999% 99.999% 
 V = 1 0.001% 0.0005% 0.0003% 0.001% 0.001% 
A = 1 and B = 1      
 V = 0 47.4% 34% 23.1% 20.3% 90% 
 V = 1 52.6% 66% 76.9% 79.7% 10% 
A = 1 and B = 0      
 V = 0 99.1% 99% 99% 98% 98.9% 
 V = 1 0.9% 1% 1% 2% 1.1% 
A = 0 and B = 1      
 V = 0 99.1% 99% 99% 98% 98.9% 
 V = 1 0.9% 1% 1% 2% 1.1% 

Cor1, correlation between test A and test B for variants; Cor2, correlation between test A and test B for non-variants; π, variant prevalence.

Performance in Analyses of Real Data

The number of bps in the GIAB benchmark region is around 2.5 × 109, of which 3,238,599 variants in the gold standard set. The number of called variants within the same region in the joint VCF file is 3,351,415 (precisely 3,311,321; 3,308,075; 3,305,948 for the three replicates, respectively).

Within that region, the estimated Se for the three replicates ranged from 98.97 to 98.99% and Sp from 99.9958 to 99.9960%. For called variants, the PPV for the three replicates ranged from 96.82 to 96.95% (online suppl. Table S4).

In the concordance analysis, for replicates 1 and 2, the percentage of concordant bps across all positions in the VCF file (called as variant in at least one replicate) was 98.38% and the percentages of variants and non-variants in the benchmark set were 99.85% and 99.9980%, respectively (Table 4).

Table 4.

Analyses of concordance between replicates

Replicate comparisonsConcordance rateCorrelationFalse concordancePPV for positive concordance (11)PPV for discordance (01 or 10)
For variants (N = 3,238,599) 
 Rep 1 versus Rep 2 99.85% 93.28% 0.94% 
 Rep 1 versus Rep 3 99.85% 92.87% 0.94% 
 Rep 2 versus Rep 3 99.83% 92.37% 0.94% 
For non-variants (N = 2,502,460,587) 
 Rep 1 versus Rep 2 99.9980% 76.04% 0.0032% 
 Rep 1 versus Rep 3 99.9980% 75.14% 0.0031% 
 Rep 2 versus Rep 3 99.9980% 74.62% 0.0031% 
For all positions in the VCF (N = 3,351,415) 
 Rep 1 versus Rep 2 98.38% 97.57% 8.89% 
 Rep 1 versus Rep 3 98.34% 97.62% 8.87% 
 Rep 2 versus Rep 3 98.33% 97.66% 9.57% 
Replicate comparisonsConcordance rateCorrelationFalse concordancePPV for positive concordance (11)PPV for discordance (01 or 10)
For variants (N = 3,238,599) 
 Rep 1 versus Rep 2 99.85% 93.28% 0.94% 
 Rep 1 versus Rep 3 99.85% 92.87% 0.94% 
 Rep 2 versus Rep 3 99.83% 92.37% 0.94% 
For non-variants (N = 2,502,460,587) 
 Rep 1 versus Rep 2 99.9980% 76.04% 0.0032% 
 Rep 1 versus Rep 3 99.9980% 75.14% 0.0031% 
 Rep 2 versus Rep 3 99.9980% 74.62% 0.0031% 
For all positions in the VCF (N = 3,351,415) 
 Rep 1 versus Rep 2 98.38% 97.57% 8.89% 
 Rep 1 versus Rep 3 98.34% 97.62% 8.87% 
 Rep 2 versus Rep 3 98.33% 97.66% 9.57% 

Table 4 shows estimated correlation coefficients and rates of false concordant bps. For variant and non-variant positions, the correlation coefficients between any two replicates ranged from 92.37% to 93.28% (with 0.94% rate of false concordant bps) and from 74.62% to 76.04% (with 0.0031–0.0032% rate of false concordant bps). Regarding all observed results, the PPV ranged from 97.57% to 97.66% for positive concordant results between any two replicates and from 8.87% to 9.57% for discordant results.

Covariate Effects with Real Data

The functional forms that describe each covariate effect in each discordance or error model were quite close (shown in Fig. 1). In each model, the covariable’s effect was significantly different from 0 (p < 2e−16). As expected, overall, GQ, MQ, and QD had decreasing trends; i.e., the higher was the score, the lower was the probability of discordance or error. The DP had a V-shape effect; the lowest probability of error or discordance corresponded to the DP value with the highest density (shown in Fig. 2).

Fig. 1.

Univariate regression models with smoothing. The first row is for the relationship between probability of discordance and each covariate and the second for the relationships between probability of error and each covariate. Each curve shows the estimated function for each covariate. The top right corner of each graph shows the percentage of deviance explained by the covariate.

Fig. 1.

Univariate regression models with smoothing. The first row is for the relationship between probability of discordance and each covariate and the second for the relationships between probability of error and each covariate. Each curve shows the estimated function for each covariate. The top right corner of each graph shows the percentage of deviance explained by the covariate.

Close modal
Fig. 2.

Density functions of the covariates. The first row presents density functions of the covariates for concordant and discordant calling results, the second row presents density functions for correct and error calling results.

Fig. 2.

Density functions of the covariates. The first row presents density functions of the covariates for concordant and discordant calling results, the second row presents density functions for correct and error calling results.

Close modal

In the error model, AB showed an M-shape effect; the three lowest probabilities of error corresponded to AB values close to 0, 0.5, and 1. In the discordance model, the main difference was that the probability of discordance did not show a minimum at AB values close to 0. This difference was also found in the density graphs, where the density of discordance increases as AB approaches 0. However, the contribution of each covariate to the deviance differed between discordance and error models (e.g., DP explained 11.6% of the deviance in discordance model vs. 22.2% in error model). GQ, AB, and QD were more correlated with discordance, whereas DP and MP were more correlated with error.

In the multivariate analyses, the optimal regression models for discordance and error were the models that included all five covariates. In each model, each covariable’s effect was significantly different from 0 (p < 2e−16). The deviance explained by the model was slightly higher with the error than with the discordance model (69.7 vs. 67.7%) (shown in Fig. 3). The shapes of the estimated functions of the adjusted covariate effect differed often between multivariate and univariate models (Fig. 3 vs. Fig. 1) and sometimes between discordance and error model; e.g., DP, AB, and MQ (shown in Fig. 3). DP had a V-shape in multivariate and univariate model of error; however, in the multivariate model of discordance, it had a quasi-monotonically decreasing trend. MQ quality had little effect in the multivariate discordance model, but a similar form in univariate and multivariate error models. GQ had little effect in multivariate models for discordance or error, despite clear decreasing trends in both univariate models. AB had similar functional forms in the univariate and the multivariate discordance model but a smaller effect in the multivariate model. However, in the error model, after the first peak at around 0.15, the function did not show the second peak of the M-shape as in the univariate model, instead, the estimated error probability decreased as AB increased. QD had similar functional forms in univariate and both multivariate models.

Fig. 3.

Multivariate regression models with smoothing. The first row uses a model for the relationships between probability of discordance and each covariate and the second another model for the relationships between probability of error and each covariate. Each curve shows the estimated function for each covariate when the other covariates in the model are set to their median values. The percentages of deviance explained by those models are 67.7% and 69.7%, respectively.

Fig. 3.

Multivariate regression models with smoothing. The first row uses a model for the relationships between probability of discordance and each covariate and the second another model for the relationships between probability of error and each covariate. Each curve shows the estimated function for each covariate when the other covariates in the model are set to their median values. The percentages of deviance explained by those models are 67.7% and 69.7%, respectively.

Close modal

In WGS studies on real data, concordance between replicates compensates for the lack of gold standard. However, its appropriateness has been rarely questioned. This study clarified the relationship between discordance and error in real data analyses.

With conditional independence between two sequencing results, the overall probability of error for concordant results was almost negligible; the concordance-discordance method may then stand for gold standard. However, a high percentage of false concordant results was found with high correlations between sequencing results; the method becomes questionable. With sequencing of NA12878 genome, the probability of being a variant in case of discordant bps between replicates ranged from 8.87% to 9.57%; thus, in pairs of discordant calls, the positive call has a high probability of being a FP. However, the PPV for a concordant pair of positive calls ranged from 97.57% to 97.66%, which is not much higher than the PPV for a single positive call (96.82–96.95%); thus, the error rate for concordant positive calls (around 2.4%) is still non-negligible.

The univariate modelling analyses showed similar shapes for estimated functions of most covariates but the percentages of model-explained deviance differed between discordance and error models. The multivariate analyses showed different shapes for estimated functions of some covariates between error and discordance models recommending a cautious use of the estimates of the determinants of discordance.

This study used sequencing results from the same sequencer and the same variant caller, a usual practice in replicate studies. This approach often generates high correlations between two sets of results [14, 30]. The more correlated are two sequencing tests, the higher is the probability of concordant results. Here, replicates were jointly genotyped by GATK [31], which is expected to generate even higher correlations. Higher level of correlation decreases the number of discordant bps and increases the number of false concordant bps in variants and non-variants; this increased number of false concordant bps would result in lower PPV for concordant positive bps and lower NPV for concordant negative bps, given the same Se and Sp. The observed correlation between replicates was about 93% among variants in the GIAB benchmark region and about 75% among non-variants in the same region (Table 4). These correlation values are close to the high correlation scenario among the three theoretical scenarios (30%, 50%, 90%). In the theoretical high correlation scenario, the correlation yielded much more concordant FNs than single FNs (i.e., discordances) for variants and much more concordant FPs than single FPs (i.e., discordances) for non-variants (Table 2). In the real data, similar tendencies were also observed. For true variants, the percentage of FNs for each replicate, i.e., 1 Se, ranged from 1.01 to 1.03% (online suppl. Table S4) and the percentage of concordant FNs among all FNs was 0.94%; thus, nearly 92% of FNs in one replicate are shared with another replicate. Similarly, for non-variants, the percentage of FPs in each sequencing was 0.0041% and the percentage of concordant FPs among all FPs was 0.0031%; thus, nearly 76% of FPs would be concordant with another replicate. This observation demonstrates the limitation of the concordance criterion, i.e., adding another technical replicate did not bring as much information as expected since more than 90% of the errors are shared.

Using “local match” instead of genotype match led to high estimates of performance metrics. Thus, Sp estimates with the real data exceeded those seen with the theoretical data (99.996 vs. 99.9%), and Se estimates reached the highest of two values in the theoretical data (99%). These performance metrics recall those of other studies: (i) 97% Se and 98.6% PPV with GATK4 for SNVs in the NA12878 genome in [32]; (ii) 98.66% Se and 99.15% PPV for whole exome regions in [27]; and (iii) 97.17% Se and 99.999% Sp with BWA-MEM and GATK UnifiedGenotyper in [4].

Here, discordance as indicator of error (more specifically, of false positivity) seemed acceptable for variants and non-variants; the PPV for the positive call in a discordant pair was 9% and the PPV for a positive concordant pair was 97.3%. Still, interpreting discordant or concordant results requires carefulness. In fact, the PPV of an individual positive call was around 96.6%; thus, there was no great difference between the PPV of a separate positive call and that of a concordance positive call due to strong correlation.

Here too, the concordance rate between replicates was 98.3–98.4% for all called variants and around 99.9980% for all positions in the benchmark region. These rates recall previous rates in whole-genome sequences: (i) 98.69% concordance rate among called SNVs in WGS [33]; (ii) 99.49% average pairwise concordance rate between replicates sequenced in different centres and called by GATK pipeline [30]; and (iii) 99.998% concordance rate among all callable positions across the whole genome [9].

The NA12878 genome was chosen in this study due to the need of a well-established gold standard for the analysis, so we would have a better understanding of the results and be able to compare to other researches. To complete the study regarding the generalizability across samples, simulations were conducted under general hypothesis that is expected to be valid across different genome with European ancestor. The summary of the simulation is provided as supplementary material.

Regarding covariates’ effects, the shapes of the estimated functions for discordance and error were often comparable but the percentages of model-explained deviance differed: GQ, AB, and QD were more associated with discordance (e.g., for GQ, 36.9% for discordance and 9.49% for error), whereas DP and MQ were more associated with error (e.g., for DP, 11.6% for discordance and 22.2% for error). Thus, using discordance instead of error may lead to different model fits. In terms of functional forms, the V-shape for DP and M-shape for AB in error model are consistent with previous findings [13, 34]. For AB, different shapes appeared at AB values: the estimated probability of discordance increased whereas the probability of error decreased for values of AB closest to 0. This tendency for discordance could be largely explained by the characteristic of bps in a VCF file. When a given position in a given replicate has an AB value very close or equal to 0, meaning zero alternative allele was observed, it is expected that the calling result of this bp would be probably non-variant. However, every bp presented in a VCF file is called positive by at least one replicate, thus the result from at least one of the other replicates could be expected to be positive, resulting a discordance between the replicates. Therefore, the discordance rate is high for positions with AB value approaching 0. However, the error rate for these positions with AB values approaching 0 is actually very low, as shown in the error model, and in accordance with the observation in our results that for bps with discordant results, around 9% are true variants and the remaining 91% non-variants (the tendency would not be the same if all bps in the genome with AB value close to 0 were considered in the model).

Comparing univariate and multivariate models, the estimated functional forms of QD were similar in discordant model and also similar in the error model. On the opposite, the functional forms of AB showed substantial differences, especially in the error model, the second peak of the M-shape being strongly attenuated in the multivariate model. The contribution of GQ in both multivariate models and that of MQ in the discordance model was small (almost flat shapes). These differences in the estimated functions between univariate and multivariate analyses are probably due to different correlation levels between covariates: high for AB and QD (0.9) and moderate for GQ and MQ (0.4). DP had very low correlation levels with the other covariates, explaining the similar functional form observed for DP in the multivariate model.

One limitation of the study was the use of “local match” instead of a more accurate “genotype match” or “allele match.” Though further analyses should check for similar conclusions with the latter matches, the knowledge, methods, and applications in this study keep being valid. This work is also a step towards predictive models with multiple covariates to generate estimated error rates for individual bp positions that could be envisaged in future works.

In summary, the concordance-discordance model has a wide application in research involving NGS as a measure of performance. It is one of the few available indicators in the absence of gold standard, which is often the case when sequencing real individuals in laboratories. This study showed that, by considering only concordance positive results as true, the output set would have higher precision than the call set before applying this criterion; however, there is still a significant number of undetected shared errors. The number of undetected errors could be larger than that of detected errors. The more correlated the replicates are, the higher the proportion of shared errors will be. From a statistical point of view, one can reduce random errors by combining results from replicates, but not the systematic errors shared by replicates due to characteristics of the NGS technology. Therefore, the presence of remained errors should be kept in mind when interpreting performance indicators obtained from the concordance criterion.

An ethics statement was not required for this study type, no human or animal subjects or materials were used.

The authors have no conflicts of interest to declare.

The first author was supported by a China Scholarship Council grant (Grant No. 201906230310). The authors acknowledge the support by project SIRIC (LYRICAN, Grant No. INCa-DGOS-INSERM _12563) and by AURAGEN platform (France Médecine Génomique 2025 National Plan).

Y.Z. designed the study, performed the statistical analysis, and drafted the manuscript. C.B. contributed to data extraction and did the alignment. M.V. did the variant calling. J.I. helped writing, commenting on, and editing the manuscript. P.R. designed the study and reviewed the manuscript.

The data that support the findings of this study are not publicly available due to data sharing policy of the sequencing centre but are available from the corresponding author upon reasonable request. Further enquiries can be directed to the corresponding author.

1.
Bentley
DR
,
Balasubramanian
S
,
Swerdlow
HP
,
Smith
GP
,
Milton
J
,
Brown
CG
, et al
.
Accurate whole human genome sequencing using reversible terminator chemistry
.
Nature
.
2008
;
456
(
7218
):
53
9
.
2.
Wall
JD
,
Tang
LF
,
Zerbe
B
,
Kvale
MN
,
Kwok
PY
,
Schaefer
C
, et al
.
Estimating genotype error rates from high-coverage next-generation sequence data
.
Genome Res
.
2014
;
24
(
11
):
1734
9
.
3.
Cornish
A
,
Guda
C
.
A comparison of variant calling pipelines using genome in a Bottle as a reference
.
BioMed Res Int
.
2015
;
2015
:
1
11
.
4.
Highnam
G
,
Wang
JJ
,
Kusler
D
,
Zook
J
,
Vijayan
V
,
Leibovich
N
, et al
.
An analytical framework for optimizing variant discovery from personal genomes
.
Nat Commun
.
2015
;
6
(
1
):
6275
.
5.
Meynert
AM
,
Bicknell
LS
,
Hurles
ME
,
Jackson
AP
,
Taylor
MS
.
Quantifying single nucleotide variant detection sensitivity in exome sequencing
.
BMC Bioinf
.
2013
;
14
(
1
):
195
.
6.
Reumers
J
,
De Rijk
P
,
Zhao
H
,
Liekens
A
,
Smeets
D
,
Cleary
J
, et al
.
Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing
.
Nat Biotechnol
.
2011
;
30
(
1
):
61
8
.
7.
Lam
HYK
,
Clark
MJ
,
Chen
R
,
Chen
R
,
Natsoulis
G
,
O’Huallachain
M
, et al
.
Performance comparison of whole-genome sequencing platforms
.
Nat Biotechnol
.
2011
;
30
(
1
):
78
82
.
8.
Ratan
A
,
Miller
W
,
Guillory
J
,
Stinson
J
,
Seshagiri
S
,
Schuster
SC
.
Comparison of sequencing platforms for single nucleotide variant calls in a human sample
.
PLoS One
. In:
Yu
Z
, editor.
2013
;
8
(
2
):
e55089
.
9.
Ajay
SS
,
Parker
SCJ
,
Abaan
HO
,
Fajardo
KVF
,
Margulies
EH
.
Accurate and comprehensive sequencing of personal genomes
.
Genome Res
.
2011
;
21
(
9
):
1498
505
.
10.
Krusche
P
,
Trigg
L
,
Boutros
PC
,
Mason
CE
,
De La Vega
FM
,
Moore
BL
, et al
.
Best practices for benchmarking germline small-variant calls in human genomes
.
Nat Biotechnol
.
2019
;
37
(
5
):
555
60
.
11.
Beck
TF
,
Mullikin
JC
,
Biesecker
LG
;
NISC Comparative Sequencing Program
.
Systematic evaluation of sanger validation of next-generation sequencing variants
.
Clin Chem
.
2016
;
62
(
4
):
647
54
.
12.
Goldfeder
RL
,
Priest
JR
,
Zook
JM
,
Grove
ME
,
Waggott
D
,
Wheeler
MT
, et al
.
Medical implications of technical accuracy in genome sequencing
.
Genome Med
.
2016
;
8
(
1
):
24
.
13.
Li
H
.
Toward better understanding of artifacts in variant calling from high-coverage samples
.
Bioinformatics
.
2014
;
30
(
20
):
2843
51
.
14.
Robasky
K
,
Lewis
NE
,
Church
GM
.
The role of replicates for error mitigation in next-generation sequencing
.
Nat Rev Genet
.
2014
;
15
(
1
):
56
62
.
15.
Li
H
,
Bloom
JM
,
Farjoun
Y
,
Fleharty
M
,
Gauthier
L
,
Neale
B
, et al
.
A synthetic-diploid benchmark for accurate variant-calling evaluation
.
Nat Methods
.
2018
;
15
(
8
):
595
7
.
16.
Treangen
TJ
,
Salzberg
SL
.
Repetitive DNA and next-generation sequencing: computational challenges and solutions
.
Nat Rev Genet
.
2011
;
13
(
1
):
36
46
.
17.
Kumaran
M
,
Subramanian
U
,
Devarajan
B
.
Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
.
BMC Bioinf
.
2019
;
20
(
1
):
342
.
18.
Hofmann
AL
,
Behr
J
,
Singer
J
,
Kuipers
J
,
Beisel
C
,
Schraml
P
, et al
.
Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers
.
BMC Bioinf
.
2017
;
18
(
1
):
8
.
19.
Sandmann
S
,
de Graaf
AO
,
Karimi
M
,
van der Reijden
BA
,
Hellström-Lindberg
E
,
Jansen
JH
, et al
.
Evaluating variant calling tools for non-matched next-generation sequencing data
.
Sci Rep
.
2017
;
7
(
1
):
43169
.
20.
Ribeiro
A
,
Golicz
A
,
Hackett
CA
,
Milne
I
,
Stephen
G
,
Marshall
D
, et al
.
An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
.
BMC Bioinf
.
2015
;
16
(
1
):
382
.
21.
Li
H
.
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
. ArXiv13033997 Q-Bio [Internet].
2013
[cited 2022 Apr 7]; Available from: http://arxiv.org/abs/1303.3997.
22.
McKenna
A
,
Hanna
M
,
Banks
E
,
Sivachenko
A
,
Cibulskis
K
,
Kernytsky
A
, et al
.
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data
.
Genome Res
.
2010
;
20
(
9
):
1297
303
.
23.
van der Auwera
G
,
O’Connor
BD
.
Genomics in the cloud: using docker, GATK, and WDL in terra
. 1st ed.
Sebastopol, CA
:
O’Reilly Media
;
2020
; p.
467
.
24.
DePristo
MA
,
Banks
E
,
Poplin
R
,
Garimella
KV
,
Maguire
JR
,
Hartl
C
, et al
.
A framework for variation discovery and genotyping using next-generation DNA sequencing data
.
Nat Genet
.
2011
;
43
(
5
):
491
8
.
25.
Wagner
J
,
Olson
ND
,
Harris
L
,
Khan
Z
,
Farek
J
,
Mahmoud
M
, et al
.
Benchmarking challenging small variants with linked and long reads
.
Cell Genom
.
2022
;
2
(
5
):
100128
.
26.
Zook
JM
,
Catoe
D
,
McDaniel
J
,
Vang
L
,
Spies
N
,
Sidow
A
, et al
.
Extensive sequencing of seven human genomes to characterize benchmark reference materials
.
Sci Data
.
2016
;
3
(
1
):
160025
.
27.
Krishnan
V
,
Utiramerur
S
,
Ng
Z
,
Datta
S
,
Snyder
MP
,
Ashley
EA
.
Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays
.
BMC Bioinf
.
2021
;
22
(
1
):
85
.
28.
Wood
SN
,
Li
Z
,
Shaddick
G
,
Augustin
NH
.
Generalized additive models for gigadata: modeling the U.K. Black smoke network daily data
.
J Am Stat Assoc
.
2017
;
112
(
519
):
1199
210
.
29.
Wood
SN
,
Goude
Y
,
Shaw
S
.
Generalized additive models for large data sets
.
J R Stat Soc Ser C Appl Stat
.
2015
;
64
(
1
):
139
55
.
30.
Naj
AC
,
Lin
H
,
Vardarajan
BN
,
White
S
,
Lancour
D
,
Ma
Y
, et al
.
Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer’s disease sequencing project
.
Genomics
.
2019
;
111
(
4
):
808
18
.
31.
Poplin
R
,
Ruano-Rubio
V
,
DePristo
MA
,
Fennell
TJ
,
Carneiro
MO
,
Van der Auwera
GA
, et al
.
Scaling accurate genetic variant discovery to tens of thousands of samples
.
Genomics
.
2017
[cited 2022 Feb 16]. Available from: http://biorxiv.org/lookup/doi/10.1101/201178.
32.
Supernat
A
,
Vidarsson
OV
,
Steen
VM
,
Stokowy
T
.
Comparison of three variant callers for human whole genome sequencing
.
Sci Rep
.
2018
;
8
(
1
):
17851
.
33.
Adelson
RP
,
Renton
AE
,
Li
W
,
Barzilai
N
,
Atzmon
G
,
Goate
AM
, et al
.
Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance
.
Sci Rep
.
2019
;
9
(
1
):
16156
.
34.
Muyas
F
,
Bosio
M
,
Puig
A
,
Susak
H
,
Domènech
L
,
Escaramis
G
, et al
.
Allele balance bias identifies systematic genotyping errors and false disease associations
.
Hum Mutat
.
2019
;
40
(
1
):
115
26
.