Introduction: The case-mother-control-mother design allows to study fetal and maternal genetic factors together with environmental exposures on early life outcomes. Mendelian constraints and conditional independence between child genotype and environmental factors enabled semiparametric likelihood methods to estimate logistic models with greater efficiency than standard logistic regression. Difficulties in child genotype collection require methods handling missing child genotype. Methods: We review a stratified retrospective likelihood and two semiparametric likelihood approaches: a prospective one and a modified retrospective one, the latter either modeling the maternal genotype as a function of covariates or leaving their joint distribution unspecified (robust version). We also review software implementing these modeling alternatives, compare their statistical properties in a simulation study, and illustrate their application, focusing on gene-environment interactions and partially missing child genotype. Results: The robust retrospective likelihood provides generally unbiased estimates, with standard errors only slightly larger than when modeling maternal genotype based on exposure. The prospective likelihood encounters maximization problems. In the application to the association of small-for-gestational-age babies with CYP2E1 and drinking water disinfection by-products, the retrospective likelihood allowed a full array of covariates, while the prospective likelihood was limited to few covariates. Conclusion: We recommend the robust version of the modified retrospective likelihood.

Obstetric and early life outcomes are influenced by the genotype of both mother and fetus, as well as factors in the mother’s environment [1]. The study design where newborns with an adverse outcome and a control sample of newborns without the same adverse outcome together with their mothers are recruited is commonly used to study these diverse influences. We call this a case-mother-control-mother design following Shi et al. [2] and it is also known as a case-control mother-child pair design [3]. An early important methodological recommendation for the analysis of data from such design was to simultaneously include the newborn and the maternal genotypes in the same regression model to mutually adjust for the confounding effect of one on the other [2]. While logistic regression can be applied as for the analysis of standard case-control data ([4], which we refer to as standard logistic regression), efficiency gains can be obtained by exploiting constraints and reasonable assumptions on the relationships between genotypic and environmental variables arising from the mother-child relationship. First, Mendelian inheritance results in dependence between the mother and child genotypes which can be leveraged through a log-linear model [2], an implementation of which allowing for stratification on an environmental factor was described by Gjerdevik et al. [5]. Second, while environmental factors acting on the pregnant mother can depend on her genotype, it is reasonable to assume these factors do not depend on the father genotype, and thus the child genotype can be modeled as conditionally independent of environmental factors given his or her mother genotype (since the child genotype is a random draw of one allele from the mother and one from the father [3]). This led to the development of semiparametric likelihood methods to estimate logistic models of genotypic and environmental predictors taking advantage of all these constraints and assumptions [3, 6].

The child-mother-control-mother design requires genotyping of both the mother and her newborn child. Obtaining child genotypes is hampered by the reluctance of parents to consent to collect biological samples (blood or saliva) from their children and the lack of cooperation from young children in providing such biological samples. This led Makao and Bureau [7] to extend the likelihood of Chen et al. [3] to handle missing genotypes for a subset of the children. The recent extension [8] of the likelihood of Zhang et al. [6] also deals with missing genotypes on a subset of children as a special case. Missing maternal genotypes are most often due to technical failures (as are some missing child genotypes). Careful biological sample collection and analysis limits such missing genotypes to a small fraction of study subjects, such that their exclusion has negligible impact on bias and precision of the estimates. Thus, we focus our method evaluation on treatment of missing child genotypes, simply noting which approach can handle missing maternal genotypes.

The objectives of this article are (1) to review methods available to analyze genetic and environmental factors in the case-mother-control-mother design allowing for partially missing-at-random child genotype, including explicitly developing the adaptation of the likelihood of [8] to partially missing child genotype data, (2) to review software implenting these methods, to evaluate their performance and illustrate their application with a focus on gene-environment interaction effects, maternal genotype-environmental exposure relationship, and missing child genotype on simulated data and on actual data from a study of the modifying effect of a CYP2E1 gene polymorphism on the association between drinking water disinfection by-products and small-for-gestational-age babies [9], and 3) to formulate recommendations on the choice of a method.

Disease Risk Model and Data

Let Y denote the binary disease status (1 = case, 0 = non-case) with f = Pr(Y = 1) being the disease prevalence in the population, GM and GC the respective maternal and offspring genotype, and X the vector of environmental variables collected from the mother. We assume GM and GC represent the number of minor alleles of a biallelic polymorphism, and thus take values in G=0,1,2. The observed values of X are indexed by j. Unless noted otherwise, X contains an arbitrary number p of discrete and/or continuous variables, such that the number of observed values of X grows to infinity with the sample size. Assuming X1 represents an exposure of interest whose effect on disease risk can be modified by either or both of GM and GC, and X2, , Xp additional covariates (e.g., confounders), a typical logistic regression model for P1jmc(ß) = Pr(Y = 1|X = xj, GM = m, GC = c) would be:
(1)

Note that the product terms for the maternal and child genotypes could involve different covariates.

Data on (Y, X, GM, GC) are collected from n1 case mother pairs and n0 control mother pairs, which are sampled from N1 (N1 > n1) case pairs and N0 (N0 > n0) non-case pairs in a population. Let nijmc denote the total number of pairs in the case-control sample with Y = i, X = xj, GM = m, and GC = c.

Prospective Likelihood

Chen et al. [3] considered a prospective likelihood, expressing the conditional independence assumption between child genotype and covariates as Pr(X = xj | GM = m, GC = c) = Pr(X = xj | GM = m) = djm. We let Pm(?) = Pr(GM = m; ?) represent the mother genotype distribution, Pc|m(?) = Pr(GC = c | GM = m; ?) the conditional offspring genotype distribution given maternal genotype, and ? the log odds of the minor allele frequency of the polymorphism which suffices to specify both genotype distributions under random mating, Hardy Weinberg equilibrium (HWE) and Mendelian inheritance. The likelihood is then:
(2)

Note that Pr(Y) involves a summation over child genotype, mother genotype, and covariate distribution, which can be infinite-dimensional, and that the numbers of cases and non-cases in the population are required. When these numbers are unavailable, an external estimate of the disease prevalence f can be used to specify large values respecting N1=f1-fN0[7]. A profile likelihood is obtained by maximizing (2) over {djm} under the constraint ?jdjm=1?m?G using a one-step estimation method [3].

The fact that the child genotype is not involved in the covariate distribution facilitated the adaptation of this likelihood to missing child genotype on a subset of the data [7]. Denoting by n~ijmc the counts on the mother-child pairs with complete data, n¯ijm the counts on the mother-child pairs with missing offspring genotype data and nijm=?cnijmc over the entire case-control sample then, under the missing-at-random assumption, we can write the likelihood of this partial data Lp:
(3)

Despite the profiling out of {djm} from the likelihood, experience has shown that maximization of the likelihood is successful only for low dimensions of X, due to saddle points in the likelihood surface. The prospective likelihood is implemented in the R package SPmlficmcm [7].

Stratified Retrospective Likelihood

Shi et al. [2] considered the likelihood of the mother and child genotypes (GM, GC) given the disease status Y. Stratification on a discrete covariate X leads to the stratified retrospective likelihood
with Pmc|ij(?, µ) = Pr(GM = m, GC = c|Y = i, X = j), a Poisson distribution. This likelihood does not depend on numbers of cases and non-cases in the population but assumes the disease is rare, such that the control sample is representative of the total population in each stratum. Here, we use µ to represent mating type frequency parameters and ? to designate regression parameters under a log-linear model since the obligation to obtain stratum-specific regression coefficients does not generally allow to specify the disease risk model as a function of the ß coefficients of model (1). Enforcing Mendelian inheritance constraints lead to the distributions in Tables 1 and 2 of Shi et al. [2] for controls and cases, respectively. Adaptation to missing child genotype is straightforward by summation of Pmc|ij (?, µ) over the child genotype:
Table 1.

Summary of the data characteristics and requirements of the compared approaches

MethodMissing genotypeaCovariates in modelContinuous exposure and covariatesExternal prevalence
childmother
Prospective likelihood  Xb Xc 
Stratified retrospective likelihood    
Retrospective likelihood, unspecified distribution of GM, X  
Retrospective likelihood, double-additive model for Pr(GM |X) 
Logistic regression    
MethodMissing genotypeaCovariates in modelContinuous exposure and covariatesExternal prevalence
childmother
Prospective likelihood  Xb Xc 
Stratified retrospective likelihood    
Retrospective likelihood, unspecified distribution of GM, X  
Retrospective likelihood, double-additive model for Pr(GM |X) 
Logistic regression    

aTreatment of missing genotypes to extract information from other variables.

bIn practice, leads to convergence problems due to large number of distinct values.

cNumbers of cases and non-cases in population are required but can be approximated from external prevalence.

Table 2.

Statistical properties of estimates of regression coefficients in model (1) and linear combinations of these coefficients with ßGM = log(1.8), ßGC = log(1.5), ßX = log(1.8), ßX×GM = log(1.2) and ßX×GM = log(1.2)

EffectsIndependenceaDependenceb
biasSEcempSEdCPebiasSEempSECP
Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) 
ßGM 0.006 0.086 0.091 0.966 0.005 0.098 0.099 0.942 
ßGC -0.008 0.085 0.083 0.948 -0.003 0.086 0.088 0.956 
ßX 0.001 0.064 0.065 0.962 0.005 0.067 0.067 0.952 
ßX×GM 0.002 0.086 0.086 0.946 -0.004 0.088 0.086 0.940 
ßX×GC -0.002 0.060 0.061 0.956 0.003 0.061 0.060 0.954 
Retrospective likelihood, double-additive model for Pr(GM| X) (function CCMO.dep) 
ßGM 0.002 0.070 0.07 0.940 -0.046 0.077 0.077 0.918 
ßGC -0.007 0.073 0.072 0.950 0.022 0.079 0.078 0.940 
ßX 0.001 0.063 0.064 0.968 0.013 0.066 0.065 0.942 
ßX×GM 0.002 0.081 0.082 0.956 -0.007 0.082 0.080 0.930 
ßX×GC -0.002 0.060 0.060 0.952 0.005 0.062 0.060 0.954 
Prospective likelihood (function Spmlficmcm)f 
ßGM 0.009 0.071 0.070 0.947 0.131 2.569 0.080 0.834 
ßGC -0.010 0.076 0.071 0.942 -0.059 0.839 0.076 0.934 
ßX 0.009 0.069 0.064 0.945 0.012 0.192 0.065 0.904 
ßX×GM -0.051 0.117 0.079 0.781 -0.107 1.028 0.075 0.681 
ßX×GC -0.006 0.06 0.060 0.950 0.028 0.411 0.06 0.936 
Logistic regression (function glm) 
ßGM 0.009 0.09 0.095 0.964 0.006 0.103 0.103 0.950 
ßGC -0.010 0.096 0.095 0.952 -0.004 0.098 0.100 0.954 
ßX 0.001 0.070 0.069 0.952 0.006 0.070 0.071 0.942 
ßX×GM 0.002 0.099 0.0950 0.942 -0.005 0.095 0.095 0.952 
ßX×GC -0.001 0.099 0.0960 0.942 0.004 0.092 0.096 0.964 
Stratified retrospective likelihood (function haplinStrat) 
ßGM-2ßX×GM -0.019 0.433 0.406 0.956 -0.552 0.571 0.549 0.874 
ßGC-2ßX×GC -0.019 0.459 0.421 0.942 0.344 0.536 0.505 0.852 
ßGM-ßX×GM -0.011 0.162 0.168 0.970 -0.251 0.207 0.201 0.778 
ßGC-ßX×GC 0.000 0.175 0.172 0.946 0.172 0.205 0.200 0.848 
ßGM 0.004 0.107 0.108 0.952 -0.044 0.114 0.117 0.938 
ßGC -0.011 0.106 0.110 0.962 0.020 0.121 0.119 0.938 
ßGM+ßX×GM -0.001 0.112 0.113 0.962 0.126 0.101 0.109 0.788 
ßGC+ßX×GC -0.012 0.116 0.114 0.954 -0.135 0.110 0.111 0.778 
ßGM+2ßX×GM -0.015 0.195 0.182 0.948 0.339 0.155 0.155 0.400 
ßGC+2ßX×GC -0.019 0.184 0.183 0.954 -0.318 0.159 0.157 0.502 
EffectsIndependenceaDependenceb
biasSEcempSEdCPebiasSEempSECP
Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) 
ßGM 0.006 0.086 0.091 0.966 0.005 0.098 0.099 0.942 
ßGC -0.008 0.085 0.083 0.948 -0.003 0.086 0.088 0.956 
ßX 0.001 0.064 0.065 0.962 0.005 0.067 0.067 0.952 
ßX×GM 0.002 0.086 0.086 0.946 -0.004 0.088 0.086 0.940 
ßX×GC -0.002 0.060 0.061 0.956 0.003 0.061 0.060 0.954 
Retrospective likelihood, double-additive model for Pr(GM| X) (function CCMO.dep) 
ßGM 0.002 0.070 0.07 0.940 -0.046 0.077 0.077 0.918 
ßGC -0.007 0.073 0.072 0.950 0.022 0.079 0.078 0.940 
ßX 0.001 0.063 0.064 0.968 0.013 0.066 0.065 0.942 
ßX×GM 0.002 0.081 0.082 0.956 -0.007 0.082 0.080 0.930 
ßX×GC -0.002 0.060 0.060 0.952 0.005 0.062 0.060 0.954 
Prospective likelihood (function Spmlficmcm)f 
ßGM 0.009 0.071 0.070 0.947 0.131 2.569 0.080 0.834 
ßGC -0.010 0.076 0.071 0.942 -0.059 0.839 0.076 0.934 
ßX 0.009 0.069 0.064 0.945 0.012 0.192 0.065 0.904 
ßX×GM -0.051 0.117 0.079 0.781 -0.107 1.028 0.075 0.681 
ßX×GC -0.006 0.06 0.060 0.950 0.028 0.411 0.06 0.936 
Logistic regression (function glm) 
ßGM 0.009 0.09 0.095 0.964 0.006 0.103 0.103 0.950 
ßGC -0.010 0.096 0.095 0.952 -0.004 0.098 0.100 0.954 
ßX 0.001 0.070 0.069 0.952 0.006 0.070 0.071 0.942 
ßX×GM 0.002 0.099 0.0950 0.942 -0.005 0.095 0.095 0.952 
ßX×GC -0.001 0.099 0.0960 0.942 0.004 0.092 0.096 0.964 
Stratified retrospective likelihood (function haplinStrat) 
ßGM-2ßX×GM -0.019 0.433 0.406 0.956 -0.552 0.571 0.549 0.874 
ßGC-2ßX×GC -0.019 0.459 0.421 0.942 0.344 0.536 0.505 0.852 
ßGM-ßX×GM -0.011 0.162 0.168 0.970 -0.251 0.207 0.201 0.778 
ßGC-ßX×GC 0.000 0.175 0.172 0.946 0.172 0.205 0.200 0.848 
ßGM 0.004 0.107 0.108 0.952 -0.044 0.114 0.117 0.938 
ßGC -0.011 0.106 0.110 0.962 0.020 0.121 0.119 0.938 
ßGM+ßX×GM -0.001 0.112 0.113 0.962 0.126 0.101 0.109 0.788 
ßGC+ßX×GC -0.012 0.116 0.114 0.954 -0.135 0.110 0.111 0.778 
ßGM+2ßX×GM -0.015 0.195 0.182 0.948 0.339 0.155 0.155 0.400 
ßGC+2ßX×GC -0.019 0.184 0.183 0.954 -0.318 0.159 0.157 0.502 

aIndependence between environmental exposure X1 and maternal genotype GM.

bLinear dependence between environmental exposure X1 and maternal genotype GM (see text).

cEmpirical standard error.

dMean estimated standard error.

e95% confidence interval coverage.

fLikelihood maximization failed for 17.2% of replicates under dependence and 20.6% under independence.

g?1G = ßG- 2ßX×G, ?2G = ßG- ßX×G, ?3G = ßG, ?4G = ßG + ßX×G and ?5G = ßG + 2ßX×G where G can be either GM or GC.

The R package Haplin [5] offers a readily usable implementation of the stratified retrospective likelihood. In this package, mating type frequencies are parameterized with two µ parameters per stratum representing stratum-specific log-ratios over a reference mating type frequency captured by stratum-specific intercepts in ?. Alternative software packages implement this retrospective likelihood with a single stratum: EMIM [10] offers a numerically efficient implementation and a variety of mating type frequency parameterizations with between one and nine parameters per stratum, while the LEM package [11] presents difficulties to run in current computing environments [12]. EMIM and LEM are not as convenient as Haplin for stratifying on a categorical covariate, as scripts to perform the stratification and a test of trend of the estimates across strata would have to be written. We did not undertake this task.

Retrospective Modified Likelihood

To avoid the difficulties encountered in maximizing the prospective likelihood (2), Zhang et al. [6] proposed instead a retrospective likelihood where the conditional independence assumption between child genotype and covariates is expressed as Pr(X = xjGM = m, GC = c) = Pr(GC = c|GM = m)?Pr(GM = m| X = xj)?Pr(X = xj). A logistic model was introduced for the conditional distribution of the maternal genotype given the covariates called “double-additive” because it is additive in the number of maternal minor alleles GM and covariates X[6]:
(4)
where ?m(?) represents the frequency of genotype m and ? is a vector of regression parameters.
Tian et al. [8] developed a more general framework where Pr(GM, X) = Pr(GM| X)?Pr(X) is left unspecified (thereafter called unspecified distribution of GM, X), the child genotype can be missing, a distinction is made between the paternal and maternal origin of the alleles forming the child genotype, and the genotype of other polymorphisms in linkage disequilibrium with the target polymorphism can be included to improve inference of parental origin and missing genotypes. We write the retrospective likelihood assuming the possibly infinite-dimensional distribution of GM, X: pjm = Pr(GM = m, X = xj), is left unspecified:

An alternative would be to use the double-additive model for Pr(GM| X) and leave Pr(X) unspecified. In either case, HWE can be assumed to simplify the distribution of GM to allele frequencies.

When the number of cases and non-cases in the population is known, f can be estimated as f^=N1N0+N1, or f can be obtained from external prevalence studies, which allows implementing the approach in retrospective sampling designs where the numbers of cases and controls are unknown. Although the uncertainty in the estimation of f should be accounted for in the variance of the regression coefficient estimates, simulations studies reported by [6] have shown that variance estimates neglecting this uncertainty are accurate when the population size N0 + N1 is large. Zhang et al. [6] profiled out pj = Pr(X = xj) while [8] profiled out pjm from the likelihood, and modified the resulting profile likelihood by replacing by its limiting value the Lagrange multiplier enforcing the constraint that the disease prevalence equals f. They then showed that maximizing the modified profile likelihood avoided the saddle point problem and led to stable estimates with the same efficiency as the maximum likelihood estimates.

When the child genotype is missing in a subset of the mother-child pairs, the adaptation of the likelihood parallels that of the prospective likelihood:
(5)

Tian et al.’s [8] expectation-maximization algorithm maximizes (5) as a special case of their more general likelihood. The R functions CCMO.na and CCMO.dep implement the retrospective modified likelihood with unspecified distribution of GM, X and double-additive model for Pr(GM| X), respectively. The data characteristics and requirements of the compared approaches are summarized in Table 1 and the software packages and functions implementing them are described in the online supplementary material (for all online suppl. material, see https://doi.org/10.1159/000529559), with a focus on the options required for analyzing case-mother-control-mother data.

Evaluation on Simulated Data

We compared the performance of the functions Spmlficmcm, CCMO.dep, CCMO.na, and haplinStrat on simulated data with missing offspring genotype for a subset of mother-child pairs based on bias and variability of the estimates, coverage probabilities of confidence intervals (CIs) for model (1) coefficients and power and type I error rate of hypothesis tests that interaction coefficients equal 0. We also compared the above approaches to logistic regression implemented in the R glm function, with deletion of observations where child genotype was missing. The simulation scenarios cover dependence and independence of the maternal genotype and environmental exposure and include different proportions of missing child genotype data and missing data mechanisms. The framework is similar to the one in [6]: the phenotype Y was generated from model (1) and the environmental exposure X1 was related to the maternal genotype GM through the linear model
where ? = log(1.5) for the dependence case and ? = 0 for the independence case, and e has a standard normal distribution. We selected a model with a categorical environmental exposure X1 and no additional covariate (i.e., p = 1) so that all reviewed methods could be applied. To obtain a categorical exposure with 5 levels, the values of X1 were rounded to the nearest integer and values were floored to -2 and capped to 2. This leads to odds ratios (ORs) between consecutive values of X1 and GM = 1 versus GM = 0 approximately equal to 1.5, similar to what we observe in the case study. The effects of the model terms are set to realistic values presented in Tables 2 and 3.
Table 3.

Power (%) of the tests of interaction effectsa

EffectsIndependencebDependencec
0log(1.2)0log(1.2)
Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) 
X1× GM 6.0 56.2 4.8 56.0 
X1× GC 5.2 85.0 6.0 84.6 
Retrospective likelihood, double-additive model for Pr(GM | X) (function CCMO.dep) 
X1× GM 6.2 61.2 5.2 61.2 
X1× GC 4.8 84.8 6.0 85.0 
Prospective likelihood (function Spmlficmcm)e 
X1× GM 6.2 40.8 7.0 38.8 
X1× GC 5.4 84.1 6.0 82.4 
Logistic regression (function glm) 
X1× GM 6.2 49.0 6.0 44.8 
X1× GC 5.0 49.2 4.4 48.8 
Stratified retrospective likelihood (function gxe from Haplin) 
X1× GM 3.6 52.6 71.8 -f 
X1× GC 5.0 49.4 53.2 
EffectsIndependencebDependencec
0log(1.2)0log(1.2)
Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) 
X1× GM 6.0 56.2 4.8 56.0 
X1× GC 5.2 85.0 6.0 84.6 
Retrospective likelihood, double-additive model for Pr(GM | X) (function CCMO.dep) 
X1× GM 6.2 61.2 5.2 61.2 
X1× GC 4.8 84.8 6.0 85.0 
Prospective likelihood (function Spmlficmcm)e 
X1× GM 6.2 40.8 7.0 38.8 
X1× GC 5.4 84.1 6.0 82.4 
Logistic regression (function glm) 
X1× GM 6.2 49.0 6.0 44.8 
X1× GC 5.0 49.2 4.4 48.8 
Stratified retrospective likelihood (function gxe from Haplin) 
X1× GM 3.6 52.6 71.8 -f 
X1× GC 5.0 49.4 53.2 

aWald tests of the interaction terms except for the stratified retrospective likelihood where a trend test is performed. Other model coefficients were set to ß1 = log(1.8), ß2 = log(1.5) and ß3 = log(1.8). Type I error level was set to a = 0.05.

5% of the offspring genotypes were set to missing completely at random.

bIndependence between environmental exposure X1 and maternal genotype GM.

cLinear dependence between environmental exposure X1 and maternal genotype GM.

dPercentage of child genotypes missing completely at random.

eProportions of replicates where likelihood maximization failed were 3.2% and 17.2% for the two columns under dependance and 3.4% and 20.6% for the two columns under independance.

fPower not displayed because of inflated Type I error.

Other simulation parameters were set as follows. We assumed a single homogeneous ancestral population: we set the minor allele frequency of the biallelic polymorphism to be 0.2 and adjusted the intercept parameter in the disease risk model so that the population prevalence f was equal to 0.01. At each of 500 replicates, a population of 200,000 mother-offspring pairs was simulated, and a sample of 1,000 case-mother and 1,000 control-mother pairs was drawn with replacement from the population. The correct value f = 0.01 was used for the retrospective likelihood and the actual number of cases and non-cases in the population N0 and N1 were used for the prospective likelihood. We set to missing the genotype of 5% or 20% of the children either chosen randomly or with a probability depending on the child genotype. Such a scenario where the child genotype is missing not at random could occur if the genotype impacted the health status of the child which in turn could impact willingness to provide a biological sample.

Results of the simulation with 5% of child genotypes missing completely at random (MCAR) are reported in Table 2 and Figures 1 and 2. For Haplin, we report results for linear combinations of the ß coefficient which are estimated in each stratum (? coefficients in Pmc|ij [?, µ]). For the independence case, bias was minimal for all methods. For the dependence case, the retrospective likelihood with the double-additive logistic model produced slightly biased estimates of the maternal genotype effects. The prospective likelihood also produced biased estimates of the maternal genotype main and interaction effects (possibly due to maximization problems, see below). The bias of stratum-specific estimates from the stratified retrospective likelihood was large in most strata. The other estimates were nearly unbiased. In both cases, the retrospective likelihood produced slightly lower standard errors with the double-additive model than with the unspecified distribution of GM, X, both of which were lower than standard errors from the prospective likelihood and logistic regression. Maximization of the retrospective likelihood always succeeded, while maximization of the prospective likelihood failed for a substantial proportion of replicates and outlying coefficient estimates were obtained in the scenario under dependence for a small number of other replicates, which led to extremely large empirical standard errors. For the replicates where it succeeded, the standard error of the estimates tended to be underestimated, resulting in CI undercoverage. For the retrospective likelihood, CI coverage was close to nominal value both under dependence and independence with the unspecified distribution of GM, X and under independence only with the double-additive logistic model. Undercoverage is noticeable for the maternal genotypic effect under dependence with the double-additive logistic model, and for most estimates from the stratified retrospective likelihood, particularly under dependence due to bias. Increasing the proportion of MCAR child genotypes to 20% did not produce a noticeable increase in empirical standard error (Fig. 2), but slight biases appeared or were increased for the maternal and child genotype effects under the modified retrospective likelihood (Fig. 1). Biases were more substantial and empirical standard errors larger when the child genotype was missing not at random under the extreme scenario we considered. The bias of stratum-specific estimates from the stratified retrospective likelihood was worse under the scenarios with 20% missing child genotypes (results not shown).

Fig. 1.

Bias of the estimates of the model terms under linear dependence between environmental exposure X1 and maternal genotype GM. The models applied are identified in the legend by the R function used to fit them. The percentage of missing child genotype data and the missing data mechanism are indicated below the figure panels; MCAR: missing completely at random; MNAR: missing not at random, under the scenario where GC is missing with 14% probability when GC = 0, 28% when GC = 1 and 52% when GC = 2.

Fig. 1.

Bias of the estimates of the model terms under linear dependence between environmental exposure X1 and maternal genotype GM. The models applied are identified in the legend by the R function used to fit them. The percentage of missing child genotype data and the missing data mechanism are indicated below the figure panels; MCAR: missing completely at random; MNAR: missing not at random, under the scenario where GC is missing with 14% probability when GC = 0, 28% when GC = 1 and 52% when GC = 2.

Close modal
Fig. 2.

Empirical standard errors of the estimates of the model terms under linear dependence between environmental exposure X1 and maternal genotype GM. The standard errors of Spmlficmcm exceed 0.25 due to a few outlying values. See Figure 1 for the legend.

Fig. 2.

Empirical standard errors of the estimates of the model terms under linear dependence between environmental exposure X1 and maternal genotype GM. The standard errors of Spmlficmcm exceed 0.25 due to a few outlying values. See Figure 1 for the legend.

Close modal

We also examined the type I error and power of tests of maternal and child genotypes effect modification by X1 (Table 3). Type I error was generally well controlled; only the trend test under stratified retrospective likelihood exhibited important type I error inflation under the dependence scenario only. The modified retrospective likelihood Wald test achieved the highest power among all methods, with the double-additive logistic model having a slight advantage over the unspecified distribution of GM, X.

We investigated whether the bias of the stratified retrospective likelihood in the dependence scenario could be due to small sample within each stratum by multiplying the sample size by 10, and observed bias remained substantial despite empirical estimates of relative risk close to the expected value in the GC × GM × Y table in each stratum defined by X1 (results not shown). The stratified retrospective likelihood should yield unbiased estimates with correctly specified mating type frequencies, and the problem may be due to misspecification of the mating type frequencies in Haplin since mating symmetry is not satisfied under our simulation scenario with dependence.

In summary, as expected the modified retrospective likelihood with unspecified distribution of GM, X was robust to dependence of the environmental exposure on the maternal genotype but led to slightly less precise estimates than the double-additive logistic model when the environmental exposure was independent of the maternal genotype. Maximization problems with the prospective likelihood led to poor performance on several metrics, and logistic regression was expectedly inefficient. The implementation of the stratified retrospective likelihood in Haplin led to biased estimates in this particular simulation setup.

Case Study: CYP2E1, Drinking Water Disinfection By-Products and Small-For-Gestational-Age Babies Revisited

Trihalomethanes (THMs) are major by-products of drinking water disinfection by chlorination. The gene CYP2E1 encodes part of the cytochrome P-450 which plays an important role in phase-1 biological activation of xenobiotics and particularly chloroform [13, 14] which is the main THM. It may also metabolize other THMs such as bromodichloromethane [15]. Infante-Rivard [16] first reported that the CYP2E1 promoter variant C allele of the rs3813867 (G1259C) single nucleotide polymorphism in the genotype of newborns significantly increases the positive association of high levels of total THMs to which their mothers were exposed in drinking water during pregnancy with the risk that the newborn be small for gestational age (the modifying effect of the mother C allele was in the same direction but not statistically significant). They performed standard logistic regression of a case-mother-control-mother study of singleton births (430 cases and 412 controls), adjusted for potential confounding factors of the association between total THMs and risk of small-for-gestational-age babies but did not take into account the genotypes of the mother and newborn within the same regression model as was done in model (1).

Levallois et al. [9] attempted to replicate this modifying effect of the rs3813867 C allele in another case-mother-control-mother study. In their primary analysis, Levallois et al. [9] also performed logistic regression adjusting for potential confounding factors (fully adjusted model), but this time including both child and mother genotype in the same model for 1,427 mother-child pairs. That analysis provided no evidence of modifying effect of either the mother or child C allele. The C allele had low frequency in the study population (empirical frequency of 3% in the mothers of controls, similar to the population estimates from the prospective and retrospective likelihood analyses reported below), so that homozygous CC genotypes were absent in controls and their mothers, compatible with HWE expectations (exact test p = 1.0 in the controls, 0.64 in the mothers). The results are reproduced in Table 4, including the interaction OR and its 95% CI not reported previously. To overcome the inefficiency of logistic regression and exploit the information from mother-child pairs where child genotype was missing, the prospective likelihood (3) was applied, using a partially adjusted model containing only two major confounders (body mass index and parity) due to the limitation on the dimension of X for this approach. This partially adjusted analysis on an extended sample of 1,535 mother-child pairs after reintroducing pairs where the child genotype and other covariates were missing provided weak evidence of a modifying effect where high total THMs reduces the risk of being a case in the presence of a maternal C allele (interaction p = 0.09) while the null hypothesis that the child C allele had no modifying effect could not be rejected, contrary to the report by Infante-Rivard [16].

Table 4.

Adjusted odds ratios (95% confidence intervals) for exposure to total THM in the 4th quartile, in comparison to the first three quartiles, for newborn and maternal genotype of rs3813867 (G1259C) in CYP2E1

Method (N)rs3813867 GenotypePartial adjustmentaFull adjustmentb
Str ORc (95% CI)Inter ORd (95% CI)Inter PeStr OR (95% CI)Inter OR (95% CI)P Inter
Logistic regressionf Wild type 1.3 (1.0–1.8)   1.2 (0.8–1.8)   
(1,427) Newborn variant 1.8 (0.4–7.6) 1.3 (0.3–5.6) 0.66 3.7 (0.7–19.1) 2.0 (0.5–8.7) 0.36 
 Mother variant 0.5 (0.1–2.2) 0.4 (0.1–1.8) 0.21 0.4 (0.1–7.1) 0.4 (0.1–1.9) 0.23 
Prospective Wild type 1.3 (0.9–1.7)   NA   
likelihood Newborn variant 1.9 (0.4–7.7) 1.5 (0.4–6.2) 0.59 NA NA NA 
(1,528)g Mother variant 0.4 (0.1–1.5) 0.3 (0.1–1.2) 0.09 NA NA NA 
Retrospective Wild type 1.3 (0.9–1.7)   1.3 (0.9–1.7)   
likelihood - DAh Newborn variant 2.0 (0.4–9.8) 1.6 (0.4–6.2) 0.49 2.4 (0.5–12.3) 1.9 (0.5–7.5) 0.37 
(1,528) Mother variant 0.4 (0.1–1.4) 0.3 (0.1–1.2) 0.08 0.4 (0.1–1.5) 0.3 (0.1–1.1) 0.07 
Retrospective Wild type 1.3 (0.9–1.7)   1.3 (0.9–1.8)   
likelihood - UNi Newborn variant 2.0 (0.4–10.1) 1.6 (0.4–6.3) 0.49 2.4 (0.5–12.6) 1.9 (0.5–7.7) 0.37 
(1,528) Mother variant 0.4 (0.1–1.5) 0.3 (0.1–1.2) 0.08 0.4 (0.1–1.5) 0.3 (0.1–1.3) 0.10 
Method (N)rs3813867 GenotypePartial adjustmentaFull adjustmentb
Str ORc (95% CI)Inter ORd (95% CI)Inter PeStr OR (95% CI)Inter OR (95% CI)P Inter
Logistic regressionf Wild type 1.3 (1.0–1.8)   1.2 (0.8–1.8)   
(1,427) Newborn variant 1.8 (0.4–7.6) 1.3 (0.3–5.6) 0.66 3.7 (0.7–19.1) 2.0 (0.5–8.7) 0.36 
 Mother variant 0.5 (0.1–2.2) 0.4 (0.1–1.8) 0.21 0.4 (0.1–7.1) 0.4 (0.1–1.9) 0.23 
Prospective Wild type 1.3 (0.9–1.7)   NA   
likelihood Newborn variant 1.9 (0.4–7.7) 1.5 (0.4–6.2) 0.59 NA NA NA 
(1,528)g Mother variant 0.4 (0.1–1.5) 0.3 (0.1–1.2) 0.09 NA NA NA 
Retrospective Wild type 1.3 (0.9–1.7)   1.3 (0.9–1.7)   
likelihood - DAh Newborn variant 2.0 (0.4–9.8) 1.6 (0.4–6.2) 0.49 2.4 (0.5–12.3) 1.9 (0.5–7.5) 0.37 
(1,528) Mother variant 0.4 (0.1–1.4) 0.3 (0.1–1.2) 0.08 0.4 (0.1–1.5) 0.3 (0.1–1.1) 0.07 
Retrospective Wild type 1.3 (0.9–1.7)   1.3 (0.9–1.8)   
likelihood - UNi Newborn variant 2.0 (0.4–10.1) 1.6 (0.4–6.3) 0.49 2.4 (0.5–12.6) 1.9 (0.5–7.7) 0.37 
(1,528) Mother variant 0.4 (0.1–1.5) 0.3 (0.1–1.2) 0.08 0.4 (0.1–1.5) 0.3 (0.1–1.3) 0.10 

aOdds ratios adjusted for prepregnancy body mass index (BMI) and parity.

bOdds ratios adjusted for maternal age, maternal education, annual household income, prepregnancy BMI, parity, history of chronic disease, preeclampsia, active smoking during the third trimester, passive smoking throughout pregnancy, coffee and alcohol consumption during pregnancy and month of selection.

cStratum-specific odds ratio and 95% confidence interval.

dInteraction odds ratio and 95% confidence interval.

eInteraction p-value.

fFrom [9].

g7 mother-child pairs were removed compared to [9] due to missing values of covariates in the fully adjusted model

hDA: double-additive model of the maternal C allele with total THM dichotomized at the 4th quartile as predictor.

iUN, unspecified distribution for GM and X.

Now that the modified retrospective likelihood allowing for partially missing genotypes (5) is available, we estimated the partially adjusted and the fully adjusted model on a common dataset of 1,528 mother-child pairs with all covariates available (321 cases and 1,207 controls), comprising the 1,427 mother-child pairs with observed child genotype and the 101 with missing child genotype, using the CCMO.na and CCMO.dep functions (we did not reintroduce 7 mother-child pairs where the mother genotype was missing because such pairs could only be used under the double-additive model of the maternal genotype implemented in CCMO.dep). We prespecified total THMs as a predictor of the maternal C allele in the double-additive model, and there was indeed some evidence of association between high total THMs and the maternal C allele detectable in the control sample (OR = 1.7, exact 95% CI: [1.0, 2.8], Fisher’s exact test p = 0.04) as well as through a comparison of the prespecified model to the model assuming independence of the maternal C allele from all covariates (likelihood ratio ?2 with 1 d.f.: 5.40, p = 0.02). Analyses of a simulated dataset with the methods applied here under the partially adjusted model are illustrated in the online supplementary material.

For the partially adjusted model, there is essentially no difference between the results from the prospective and retrospective likelihoods (Table 4). Although the total THM-maternal genotype interaction OR estimate remained 0.3 at first decimal place across all retrospective likelihood analyses reported in Table 4, the coefficient estimate and its estimated standard error varied between analyses. Under the double-additive logistic model, the coefficient estimate was slightly lower for the fully adjusted model than for the partially adjusted model (-1.30 vs. -1.25) while the estimated standard error remained similar (0.73 vs. 0.72), resulting in a slightly smaller interaction p value for the fully adjusted model. Under the unspecified distribution of GM, X the same fully-to-partially adjusted model comparison revealed similar coefficients estimates (-1.26 vs. -1.25) but a larger estimated standard error (0.76 vs. 0.73) resulting in a slightly larger interaction p value for the fully adjusted model. More importantly, the fully adjusted logistic regression model produced an estimate of this coefficient (-0.87) much closer to 0 than the retrospective likelihood, with an estimated standard error (0.76) as large as under the retrospective likelihood with unspecified distribution of GM, X, so that the interaction p value was the largest among all reported tests of the null hypothesis that this coefficient equals 0. The similarity of coefficient estimates from the double-additive logistic model with total THMs as covariate with those obtained when the distribution of GM, X is left unspecified suggests this model is adequate. The interaction p value at 0.07 provides some evidence that the C allele in the mother leads to a negative association between total THMs and risk of small-for-gestational-age babies, in the opposite direction from the statistically nonsignificant increase for this association previously reported by [16]. Subsequent studies of the modifying effect of CYP2E1 polymorphisms on the association between disinfection by-products and small-for-gestational-age babies reported results that are congruent with either an increase [17] or a reduction [18] of risk by high chlorination by-product exposure in presence of the maternal C allele, but important methodological differences (SNP genotyped, exposure measured, participant characteristics, outcome definition) prevent combining estimates from all studies to reach a conclusion.

Over the last decade, semiparametric estimation methods have been developed for more efficient analysis of data from case-mother-control-mother designs. With the availability of software to fit them, they should be preferred to the inefficient standard logistic regression. We evaluated the performance of readily usable functions to fit these models in the R statistical software environment on simulated data, focusing on gene-environment interaction effects, maternal genotype-environmental exposure relationship, and missing child genotype.

The limitation of the stratified retrospective likelihood to a categorical exposure variable without adjustment for covariates leads us to discourage its application to case-mother-control-mother data. The modified retrospective likelihood implemented in the CCMO.na and CCMO.dep R functions is particularly appealing, as it can handle arbitrary covariate vectors in terms of number of variables and number of levels of the variables, as highlighted in the case study, and partially missing offspring genotypes, although large proportions of missing genotypes can induce bias, which can be substantial if child genotypes are missing not at random. The unspecified distribution of GM, X implemented in CCMO.na is recommended due to its robustness. However, when the double-additive logistic model for Pr(GM | X) holds (including the independence case), CCMO.dep provides a slight gain in efficiency. The prospective likelihood allowing for partially missing child genotype remains an option for low-dimensional covariate vectors and is implemented in the R package SPmlficmcm offering the familiar R model specification.

Although we used a single regression model, the effect sizes in this model were realistic, multiplicative genotype-exposure interactions were either present or absent, and the limited complexity was designed to suit all approaches reviewed. Further evaluation of the modified retrospective likelihood could examine more complex models including multiple continuous covariates and model misspecification, e.g., missing interaction terms, as the flexibility of this approach allows to entertain complex models.

We would like to thank co-investigators, trainees, and research assistants from Centre de recherche du CHU de Québec and Université Laval involved in the small-for-gestational-age case-mother-control-mother study of drinking water disinfection by-products and genetic factors: Molière Nguile-Makao, Manuel Rodriguez, Céline Campagna, Robert Tardif, Nathalie Bernard, Sylvie Desjardins, Catherine Gonthier, Christelle Legay, and Sylvie Marcoux.

The access to the birth certificates for the selection of cases and controls was authorized by the Commission d’accès à l’information of Québec. In the first study stage, participants provided written informed consent for collection of exposure and sociodemographic variables and, in the second study stage, reconsented to provide their own blood or saliva sample as well as cord blood or saliva sample from their child for genetic analyses. This study was approved by the CHU de Québec Ethics Review Board (project 2012-554) and conforms to the Declaration of Helsinki.

The authors have no conflicts of interest to declare.

This work was supported by the National Natural Science Foundation of China [Grant No. 12171451, 72091212, 11771096], the National Institutes of Health of the USA [Grant No. R01ES016626], and the Canadian Institutes for Health Research [Grant No. MOP 102708]. The funders were not involved in study design; collection, analysis, and interpretation of data; writing of the report; and imposed no restrictions regarding the submission of the report for publication.

Alexandre Bureau planned the study, conducted the simulations and the case study, and drafted the manuscript. Yuang Tian programmed the CCMO.dep and CCMO.na functions, wrote part of the simulation code, and formulated suggestions integrated in the manuscript text. Patrick Levallois and Yves Giguère conceived and oversaw the original study of small-for-gestational-age babies on which the case study is based. Jinbo Chen and Hong Zhang formulated suggestions integrated in the manuscript text. All authors revised the manuscript and approved the final version.

The R packages SPmlficmcm and Haplin are available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=SPmlficmcm and http://CRAN.R-project.org/package=Haplin. The functions CCMO.na and CCMO.dep are available on GitHub at https://github.com/yatian20/CCMO.na and https://github.com/yatian20/CCMO.dep, respectively. Scripts in R code to reproduce the simulation study reported in Tables 2 and 3 and the empirical distribution of a subset of variables from the case study data along with code to simulate a dataset similar to the actual one are available at https://github.com/abureau/CMCMsim. The data of the case study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.

1.
Infante-Rivard
C
.
Studying genetic predisposition among small-for-gestational-age newborns
.
Semin Perinatol
.
2007
;
31
(
4
):
213
8
.
2.
Shi
M
,
Umbach
DM
,
Vermeulen
SH
,
Weinberg
CR
.
Making the most of case-mother/control-mother studies
.
Am J Epidemiol
.
2008
;
168
(
5
):
541
7
.
3.
Chen
J
,
Lin
D
,
Hochner
H
.
Semiparametric maximum likelihood methods for analyzing genetic and environmental effects with case-control mother-child pair data
.
Biometrics
.
2012
;
68
(
3
):
869
77
.
4.
Prentice
R
,
Pyke
R
.
Logistic disease incidence models and case-control studies
.
Biometrika
.
1979
;
66
(
3
):
403
11
.
5.
Gjerdevik
M
,
Haaland
ØA
,
Romanowska
J
,
Lie
RT
,
Jugessur
A
,
Gjessing
HK
.
Parent-of-origin-environment interactions in case-parent triads with or without independent controls
.
Ann Hum Genet
.
2018
;
82
(
2
):
60
73
.
6.
Zhang
H
,
Mukherjee
B
,
Arthur
V
,
Hu
G
,
Hochner
H
,
Chen
J
.
An efficient and computationally robust statistical method for analyzing case-control mother-offspring pair genetic association studies
.
Ann Appl Stat
.
2020
;
14
(
2
):
560
84
.
7.
Nguile-Makao
M
,
Bureau
A
.
Semi-parametric maximum likelihood method for interaction in case-mother control-mother designs: package SPmlficmcm
.
J Stat Softw
.
2015
;
68
(
10
):
1
17
.
8.
Tian
Y
,
Zhang
H
,
Bureau
A
,
Hochner
H
,
Chen
J
Efficient inference of parental origin effects using case-control mother-child geneotype data
arXiv
2022
. p.
2208
.
9.
Levallois
P
,
Giguère
Y
,
Nguile-Makao
M
,
Rodriguez
M
,
Campagna
C
,
Tardif
R
.
Disinfection by-products exposure and intra-uterine growth restriction: do genetic polymorphisms of CYP2E1or deletion of GSTM1 or GSTT1 modify the association
.
Environ Int
.
2016
92-93
220
31
.
10.
Howey
R
,
Cordell
HJ
.
PREMIM and EMIM: tools for estimation of maternal, imprinting and interaction effects using multinomial modelling
.
BMC Bioinf
.
2012
;
13
:
149
.
11.
van Den Oord
EJ
,
Vermunt
JK
.
Testing for linkage disequilibrium, maternal effects, and imprinting with (In)complete case-parent triads, by use of the computer program LEM
.
Am J Hum Genet
.
2000
;
66
(
1
):
335
8
.
12.
Ray
D
,
Vergara
C
,
Taub
MA
,
Wojcik
G
,
Ladd-Acosta
C
,
Beaty
TH
.
Benchmarking statistical methods for analyzing parent-child dyads in genetic association studies
.
Genet Epidemiol
.
2022
46
5–6
266
84
.
13.
Meek
ME
,
Beauchamp
R
,
Long
G
,
Moir
D
,
Turner
L
,
Walker
M
.
Chloroform: exposure estimation, hazard characterization, and exposure-response analysis
.
J Toxicol Environ Health B Crit Rev
.
2002
;
5
(
3
):
283
334
.
14.
Gemma
S
,
Vittozzi
L
,
Testai
E
.
Metabolism of chloroform in the human liver and identification of the competent P450s
.
Drug Metab Dispos
.
2003
;
31
(
3
):
266
74
.
15.
Riederer
AM
,
Dhingra
R
,
Blount
BC
,
Steenland
K
.
Predictors of blood trihalomethane concentrations in NHANES 1999-2006
.
Environ Health Perspect
.
2014
;
122
(
7
):
695
702
.
16.
Infante-Rivard
C
.
Drinking water contaminants, gene polymorphisms, and fetal growth
.
Environ Health Perspect
.
2004
;
112
(
11
):
1213
6
.
17.
Yang
P
,
Cao
WC
,
Zhou
B
,
Zheng
TZ
,
Deng
YL
,
Luo
Q
.
Urinary biomarker of prenatal exposure to disinfection byproducts, maternal genetic polymorphisms in CYP2E1 and GSTZ1, and birth outcomes
.
Environ Sci Technol
.
2019
;
53
(
20
):
12026
34
.
18.
Kogevinas
M
,
Bustamante
M
,
Gracia-Lavedán
E
,
Ballester
F
,
Cordier
S
,
Costet
N
.
Drinking water disinfection by-products, genetic polymorphisms, and birth outcomes in a European mother-child cohort study
.
Epidemiology
.
2016
;
27
(
6
):
903
11
.