## Abstract

** Introduction:** The case-mother-control-mother design allows to study fetal and maternal genetic factors together with environmental exposures on early life outcomes. Mendelian constraints and conditional independence between child genotype and environmental factors enabled semiparametric likelihood methods to estimate logistic models with greater efficiency than standard logistic regression. Difficulties in child genotype collection require methods handling missing child genotype.

**We review a stratified retrospective likelihood and two semiparametric likelihood approaches: a prospective one and a modified retrospective one, the latter either modeling the maternal genotype as a function of covariates or leaving their joint distribution unspecified (robust version). We also review software implementing these modeling alternatives, compare their statistical properties in a simulation study, and illustrate their application, focusing on gene-environment interactions and partially missing child genotype.**

*Methods:***The robust retrospective likelihood provides generally unbiased estimates, with standard errors only slightly larger than when modeling maternal genotype based on exposure. The prospective likelihood encounters maximization problems. In the application to the association of small-for-gestational-age babies with**

*Results:**C*YP2E1 and drinking water disinfection by-products, the retrospective likelihood allowed a full array of covariates, while the prospective likelihood was limited to few covariates.

**We recommend the robust version of the modified retrospective likelihood.**

*Conclusion:*## Introduction

Obstetric and early life outcomes are influenced by the genotype of both mother and fetus, as well as factors in the mother’s environment [1]. The study design where newborns with an adverse outcome and a control sample of newborns without the same adverse outcome together with their mothers are recruited is commonly used to study these diverse influences. We call this a case-mother-control-mother design following Shi et al. [2] and it is also known as a case-control mother-child pair design [3]. An early important methodological recommendation for the analysis of data from such design was to simultaneously include the newborn and the maternal genotypes in the same regression model to mutually adjust for the confounding effect of one on the other [2]. While logistic regression can be applied as for the analysis of standard case-control data ([4], which we refer to as standard logistic regression), efficiency gains can be obtained by exploiting constraints and reasonable assumptions on the relationships between genotypic and environmental variables arising from the mother-child relationship. First, Mendelian inheritance results in dependence between the mother and child genotypes which can be leveraged through a log-linear model [2], an implementation of which allowing for stratification on an environmental factor was described by Gjerdevik et al. [5]. Second, while environmental factors acting on the pregnant mother can depend on her genotype, it is reasonable to assume these factors do not depend on the father genotype, and thus the child genotype can be modeled as conditionally independent of environmental factors given his or her mother genotype (since the child genotype is a random draw of one allele from the mother and one from the father [3]). This led to the development of semiparametric likelihood methods to estimate logistic models of genotypic and environmental predictors taking advantage of all these constraints and assumptions [3, 6].

The child-mother-control-mother design requires genotyping of both the mother and her newborn child. Obtaining child genotypes is hampered by the reluctance of parents to consent to collect biological samples (blood or saliva) from their children and the lack of cooperation from young children in providing such biological samples. This led Makao and Bureau [7] to extend the likelihood of Chen et al. [3] to handle missing genotypes for a subset of the children. The recent extension [8] of the likelihood of Zhang et al. [6] also deals with missing genotypes on a subset of children as a special case. Missing maternal genotypes are most often due to technical failures (as are some missing child genotypes). Careful biological sample collection and analysis limits such missing genotypes to a small fraction of study subjects, such that their exclusion has negligible impact on bias and precision of the estimates. Thus, we focus our method evaluation on treatment of missing child genotypes, simply noting which approach can handle missing maternal genotypes.

The objectives of this article are (1) to review methods available to analyze genetic and environmental factors in the case-mother-control-mother design allowing for partially missing-at-random child genotype, including explicitly developing the adaptation of the likelihood of [8] to partially missing child genotype data, (2) to review software implenting these methods, to evaluate their performance and illustrate their application with a focus on gene-environment interaction effects, maternal genotype-environmental exposure relationship, and missing child genotype on simulated data and on actual data from a study of the modifying effect of a CYP2E1 gene polymorphism on the association between drinking water disinfection by-products and small-for-gestational-age babies [9], and 3) to formulate recommendations on the choice of a method.

## Models and Methods

### Disease Risk Model and Data

*f*= Pr(

*Y*= 1) being the disease prevalence in the population,

*G*

^{M}and

*G*

^{C}the respective maternal and offspring genotype, and

*X*the vector of environmental variables collected from the mother. We assume

*G*

^{M}and

*G*

^{C}represent the number of minor alleles of a biallelic polymorphism, and thus take values in $G=0,1,2$. The observed values of

*X*are indexed by

*j*. Unless noted otherwise,

*X*contains an arbitrary number

*p*of discrete and/or continuous variables, such that the number of observed values of

*X*grows to infinity with the sample size. Assuming

*X*

_{1}represents an exposure of interest whose effect on disease risk can be modified by either or both of

*G*

^{M}and

*G*

^{C}, and

*X*

_{2},

*…*,

*X*

_{p}additional covariates (e.g., confounders), a typical logistic regression model for

*P*

_{1jmc}(ß) = Pr(

*Y*= 1|

*X*=

*x*

_{j},

*G*

^{M}=

*m*,

*G*

^{C}=

*c*) would be:

Note that the product terms for the maternal and child genotypes could involve different covariates.

Data on (*Y*, *X*, *G*^{M}, *G*^{C}) are collected from *n*_{1} case mother pairs and *n*_{0} control mother pairs, which are sampled from *N*_{1} (*N*_{1} > *n*_{1}) case pairs and *N*_{0} (*N*_{0} > *n*_{0}) non-case pairs in a population. Let *n*_{ijmc} denote the total number of pairs in the case-control sample with *Y* = *i*, *X* = *x*_{j}, *G*^{M} = *m*, and *G*^{C} = *c*.

### Prospective Likelihood

*X*=

*x*

_{j}|

*G*

^{M}=

*m*,

*G*

^{C}=

*c*) = Pr(

*X*=

*x*

_{j}|

*G*

^{M}=

*m*) =

*d*

_{jm}. We let

*P*

_{m}(

*?*) = Pr(

*G*

^{M}=

*m*;

*?*) represent the mother genotype distribution,

*P*

_{c|m}(

*?*) = Pr(

*G*

^{C}=

*c*|

*G*

^{M}=

*m*;

*?*) the conditional offspring genotype distribution given maternal genotype, and

*?*the log odds of the minor allele frequency of the polymorphism which suffices to specify both genotype distributions under random mating, Hardy Weinberg equilibrium (HWE) and Mendelian inheritance. The likelihood is then:

Note that Pr(*Y*) involves a summation over child genotype, mother genotype, and covariate distribution, which can be infinite-dimensional, and that the numbers of cases and non-cases in the population are required. When these numbers are unavailable, an external estimate of the disease prevalence *f* can be used to specify large values respecting $N1=f1-fN0$[7]. A profile likelihood is obtained by maximizing (2) over {*d*_{jm}} under the constraint $?jdjm=1?m?G$ using a one-step estimation method [3].

*L*

^{p}:

Despite the profiling out of {*d*_{jm}} from the likelihood, experience has shown that maximization of the likelihood is successful only for low dimensions of *X*, due to saddle points in the likelihood surface. The prospective likelihood is implemented in the R package SPmlficmcm [7].

### Stratified Retrospective Likelihood

*G*

^{M},

*G*

^{C}) given the disease status

*Y*. Stratification on a discrete covariate

*X*leads to the stratified retrospective likelihood

*P*

_{mc|ij}(?, µ) = Pr(

*G*

^{M}=

*m*,

*G*

^{C}=

*c*|

*Y*=

*i*,

*X*=

*j*), a Poisson distribution. This likelihood does not depend on numbers of cases and non-cases in the population but assumes the disease is rare, such that the control sample is representative of the total population in each stratum. Here, we use

*µ*to represent mating type frequency parameters and ? to designate regression parameters under a log-linear model since the obligation to obtain stratum-specific regression coefficients does not generally allow to specify the disease risk model as a function of the ß coefficients of model (1). Enforcing Mendelian inheritance constraints lead to the distributions in Tables 1 and 2 of Shi et al. [2] for controls and cases, respectively. Adaptation to missing child genotype is straightforward by summation of

*P*

_{mc|ij}(

*?*,

*µ*) over the child genotype:

Method . | Missing genotype^{a}
. | Covariates in model . | Continuous exposure and covariates . | External prevalence . | |
---|---|---|---|---|---|

child . | mother . | ||||

Prospective likelihood | X | X | X^{b} | X^{c} | |

Stratified retrospective likelihood | X | X | |||

Retrospective likelihood, unspecified distribution of G^{M}, X | X | X | X | X | |

Retrospective likelihood, double-additive model for Pr(G^{M} |X) | X | X | X | X | X |

Logistic regression | X | X |

Method . | Missing genotype^{a}
. | Covariates in model . | Continuous exposure and covariates . | External prevalence . | |
---|---|---|---|---|---|

child . | mother . | ||||

Prospective likelihood | X | X | X^{b} | X^{c} | |

Stratified retrospective likelihood | X | X | |||

Retrospective likelihood, unspecified distribution of G^{M}, X | X | X | X | X | |

Retrospective likelihood, double-additive model for Pr(G^{M} |X) | X | X | X | X | X |

Logistic regression | X | X |

^{a}Treatment of missing genotypes to extract information from other variables.

^{b}In practice, leads to convergence problems due to large number of distinct values.

^{c}Numbers of cases and non-cases in population are required but can be approximated from external prevalence.

Effects . | Independence^{a}
. | Dependence^{b}
. | ||||||
---|---|---|---|---|---|---|---|---|

bias . | SE^{c}_{emp}
. | SE^{d}
. | CP^{e}
. | bias . | SE_{emp}
. | SE . | CP . | |

Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) | ||||||||

$\xdfGM$ | 0.006 | 0.086 | 0.091 | 0.966 | 0.005 | 0.098 | 0.099 | 0.942 |

$\xdfGC$ | -0.008 | 0.085 | 0.083 | 0.948 | -0.003 | 0.086 | 0.088 | 0.956 |

$\xdfX$ | 0.001 | 0.064 | 0.065 | 0.962 | 0.005 | 0.067 | 0.067 | 0.952 |

$\xdfX\xd7GM$ | 0.002 | 0.086 | 0.086 | 0.946 | -0.004 | 0.088 | 0.086 | 0.940 |

$\xdfX\xd7GC$ | -0.002 | 0.060 | 0.061 | 0.956 | 0.003 | 0.061 | 0.060 | 0.954 |

Retrospective likelihood, double-additive model for Pr(G^{M}| X) (function CCMO.dep) | ||||||||

$\xdfGM$ | 0.002 | 0.070 | 0.07 | 0.940 | -0.046 | 0.077 | 0.077 | 0.918 |

$\xdfGC$ | -0.007 | 0.073 | 0.072 | 0.950 | 0.022 | 0.079 | 0.078 | 0.940 |

$\xdfX$ | 0.001 | 0.063 | 0.064 | 0.968 | 0.013 | 0.066 | 0.065 | 0.942 |

$\xdfX\xd7GM$ | 0.002 | 0.081 | 0.082 | 0.956 | -0.007 | 0.082 | 0.080 | 0.930 |

$\xdfX\xd7GC$ | -0.002 | 0.060 | 0.060 | 0.952 | 0.005 | 0.062 | 0.060 | 0.954 |

Prospective likelihood (function Spmlficmcm)^{f} | ||||||||

$\xdfGM$ | 0.009 | 0.071 | 0.070 | 0.947 | 0.131 | 2.569 | 0.080 | 0.834 |

$\xdfGC$ | -0.010 | 0.076 | 0.071 | 0.942 | -0.059 | 0.839 | 0.076 | 0.934 |

$\xdfX$ | 0.009 | 0.069 | 0.064 | 0.945 | 0.012 | 0.192 | 0.065 | 0.904 |

$\xdfX\xd7GM$ | -0.051 | 0.117 | 0.079 | 0.781 | -0.107 | 1.028 | 0.075 | 0.681 |

$\xdfX\xd7GC$ | -0.006 | 0.06 | 0.060 | 0.950 | 0.028 | 0.411 | 0.06 | 0.936 |

Logistic regression (function glm) | ||||||||

$\xdfGM$ | 0.009 | 0.09 | 0.095 | 0.964 | 0.006 | 0.103 | 0.103 | 0.950 |

$\xdfGC$ | -0.010 | 0.096 | 0.095 | 0.952 | -0.004 | 0.098 | 0.100 | 0.954 |

$\xdfX$ | 0.001 | 0.070 | 0.069 | 0.952 | 0.006 | 0.070 | 0.071 | 0.942 |

$\xdfX\xd7GM$ | 0.002 | 0.099 | 0.0950 | 0.942 | -0.005 | 0.095 | 0.095 | 0.952 |

$\xdfX\xd7GC$ | -0.001 | 0.099 | 0.0960 | 0.942 | 0.004 | 0.092 | 0.096 | 0.964 |

Stratified retrospective likelihood (function haplinStrat) | ||||||||

$\xdfGM-2\xdfX\xd7GM$ | -0.019 | 0.433 | 0.406 | 0.956 | -0.552 | 0.571 | 0.549 | 0.874 |

$\xdfGC-2\xdfX\xd7GC$ | -0.019 | 0.459 | 0.421 | 0.942 | 0.344 | 0.536 | 0.505 | 0.852 |

$\xdfGM-\xdfX\xd7GM$ | -0.011 | 0.162 | 0.168 | 0.970 | -0.251 | 0.207 | 0.201 | 0.778 |

$\xdfGC-\xdfX\xd7GC$ | 0.000 | 0.175 | 0.172 | 0.946 | 0.172 | 0.205 | 0.200 | 0.848 |

$\xdfGM$ | 0.004 | 0.107 | 0.108 | 0.952 | -0.044 | 0.114 | 0.117 | 0.938 |

$\xdfGC$ | -0.011 | 0.106 | 0.110 | 0.962 | 0.020 | 0.121 | 0.119 | 0.938 |

$\xdfGM+\xdfX\xd7GM$ | -0.001 | 0.112 | 0.113 | 0.962 | 0.126 | 0.101 | 0.109 | 0.788 |

$\xdfGC+\xdfX\xd7GC$ | -0.012 | 0.116 | 0.114 | 0.954 | -0.135 | 0.110 | 0.111 | 0.778 |

$\xdfGM+2\xdfX\xd7GM$ | -0.015 | 0.195 | 0.182 | 0.948 | 0.339 | 0.155 | 0.155 | 0.400 |

$\xdfGC+2\xdfX\xd7GC$ | -0.019 | 0.184 | 0.183 | 0.954 | -0.318 | 0.159 | 0.157 | 0.502 |

Effects . | Independence^{a}
. | Dependence^{b}
. | ||||||
---|---|---|---|---|---|---|---|---|

bias . | SE^{c}_{emp}
. | SE^{d}
. | CP^{e}
. | bias . | SE_{emp}
. | SE . | CP . | |

Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) | ||||||||

$\xdfGM$ | 0.006 | 0.086 | 0.091 | 0.966 | 0.005 | 0.098 | 0.099 | 0.942 |

$\xdfGC$ | -0.008 | 0.085 | 0.083 | 0.948 | -0.003 | 0.086 | 0.088 | 0.956 |

$\xdfX$ | 0.001 | 0.064 | 0.065 | 0.962 | 0.005 | 0.067 | 0.067 | 0.952 |

$\xdfX\xd7GM$ | 0.002 | 0.086 | 0.086 | 0.946 | -0.004 | 0.088 | 0.086 | 0.940 |

$\xdfX\xd7GC$ | -0.002 | 0.060 | 0.061 | 0.956 | 0.003 | 0.061 | 0.060 | 0.954 |

Retrospective likelihood, double-additive model for Pr(G^{M}| X) (function CCMO.dep) | ||||||||

$\xdfGM$ | 0.002 | 0.070 | 0.07 | 0.940 | -0.046 | 0.077 | 0.077 | 0.918 |

$\xdfGC$ | -0.007 | 0.073 | 0.072 | 0.950 | 0.022 | 0.079 | 0.078 | 0.940 |

$\xdfX$ | 0.001 | 0.063 | 0.064 | 0.968 | 0.013 | 0.066 | 0.065 | 0.942 |

$\xdfX\xd7GM$ | 0.002 | 0.081 | 0.082 | 0.956 | -0.007 | 0.082 | 0.080 | 0.930 |

$\xdfX\xd7GC$ | -0.002 | 0.060 | 0.060 | 0.952 | 0.005 | 0.062 | 0.060 | 0.954 |

Prospective likelihood (function Spmlficmcm)^{f} | ||||||||

$\xdfGM$ | 0.009 | 0.071 | 0.070 | 0.947 | 0.131 | 2.569 | 0.080 | 0.834 |

$\xdfGC$ | -0.010 | 0.076 | 0.071 | 0.942 | -0.059 | 0.839 | 0.076 | 0.934 |

$\xdfX$ | 0.009 | 0.069 | 0.064 | 0.945 | 0.012 | 0.192 | 0.065 | 0.904 |

$\xdfX\xd7GM$ | -0.051 | 0.117 | 0.079 | 0.781 | -0.107 | 1.028 | 0.075 | 0.681 |

$\xdfX\xd7GC$ | -0.006 | 0.06 | 0.060 | 0.950 | 0.028 | 0.411 | 0.06 | 0.936 |

Logistic regression (function glm) | ||||||||

$\xdfGM$ | 0.009 | 0.09 | 0.095 | 0.964 | 0.006 | 0.103 | 0.103 | 0.950 |

$\xdfGC$ | -0.010 | 0.096 | 0.095 | 0.952 | -0.004 | 0.098 | 0.100 | 0.954 |

$\xdfX$ | 0.001 | 0.070 | 0.069 | 0.952 | 0.006 | 0.070 | 0.071 | 0.942 |

$\xdfX\xd7GM$ | 0.002 | 0.099 | 0.0950 | 0.942 | -0.005 | 0.095 | 0.095 | 0.952 |

$\xdfX\xd7GC$ | -0.001 | 0.099 | 0.0960 | 0.942 | 0.004 | 0.092 | 0.096 | 0.964 |

Stratified retrospective likelihood (function haplinStrat) | ||||||||

$\xdfGM-2\xdfX\xd7GM$ | -0.019 | 0.433 | 0.406 | 0.956 | -0.552 | 0.571 | 0.549 | 0.874 |

$\xdfGC-2\xdfX\xd7GC$ | -0.019 | 0.459 | 0.421 | 0.942 | 0.344 | 0.536 | 0.505 | 0.852 |

$\xdfGM-\xdfX\xd7GM$ | -0.011 | 0.162 | 0.168 | 0.970 | -0.251 | 0.207 | 0.201 | 0.778 |

$\xdfGC-\xdfX\xd7GC$ | 0.000 | 0.175 | 0.172 | 0.946 | 0.172 | 0.205 | 0.200 | 0.848 |

$\xdfGM$ | 0.004 | 0.107 | 0.108 | 0.952 | -0.044 | 0.114 | 0.117 | 0.938 |

$\xdfGC$ | -0.011 | 0.106 | 0.110 | 0.962 | 0.020 | 0.121 | 0.119 | 0.938 |

$\xdfGM+\xdfX\xd7GM$ | -0.001 | 0.112 | 0.113 | 0.962 | 0.126 | 0.101 | 0.109 | 0.788 |

$\xdfGC+\xdfX\xd7GC$ | -0.012 | 0.116 | 0.114 | 0.954 | -0.135 | 0.110 | 0.111 | 0.778 |

$\xdfGM+2\xdfX\xd7GM$ | -0.015 | 0.195 | 0.182 | 0.948 | 0.339 | 0.155 | 0.155 | 0.400 |

$\xdfGC+2\xdfX\xd7GC$ | -0.019 | 0.184 | 0.183 | 0.954 | -0.318 | 0.159 | 0.157 | 0.502 |

^{a}Independence between environmental exposure *X*_{1} and maternal genotype *G*^{M}.

^{b}Linear dependence between environmental exposure *X*_{1} and maternal genotype *G*^{M} (see text).

^{c}Empirical standard error.

^{d}Mean estimated standard error.

^{e}95% confidence interval coverage.

^{f}Likelihood maximization failed for 17.2% of replicates under dependence and 20.6% under independence.

^{g}?_{1G} = ß_{G}*-* 2*ß*_{X×G}, ?_{2G} = ß_{G}*-* ß_{X×G}, ?_{3G} = ß_{G}, ?_{4G} = ß_{G} + ß_{X×G} and ?_{5G} = ß_{G} + 2ß_{X×G} where *G* can be either *G*^{M} or *G*^{C}.

The R package Haplin [5] offers a readily usable implementation of the stratified retrospective likelihood. In this package, mating type frequencies are parameterized with two µ parameters per stratum representing stratum-specific log-ratios over a reference mating type frequency captured by stratum-specific intercepts in ?. Alternative software packages implement this retrospective likelihood with a single stratum: EMIM [10] offers a numerically efficient implementation and a variety of mating type frequency parameterizations with between one and nine parameters per stratum, while the LEM package [11] presents difficulties to run in current computing environments [12]. EMIM and LEM are not as convenient as Haplin for stratifying on a categorical covariate, as scripts to perform the stratification and a test of trend of the estimates across strata would have to be written. We did not undertake this task.

### Retrospective Modified Likelihood

*X*=

*x*

_{j}

*G*

^{M}=

*m*,

*G*

^{C}=

*c*) = Pr(

*G*

^{C}=

*c*|

*G*

^{M}=

*m*)?Pr(

*G*

^{M}=

*m*|

*X*=

*x*

_{j})?Pr(

*X*=

*x*

_{j}). A logistic model was introduced for the conditional distribution of the maternal genotype given the covariates called “double-additive” because it is additive in the number of maternal minor alleles

*G*

^{M}and covariates

*X*[6]:

_{m}(?) represents the frequency of genotype

*m*and

*?*is a vector of regression parameters.

*G*

^{M},

*X*) = Pr(

*G*

^{M}|

*X*)?Pr(

*X*) is left unspecified (thereafter called unspecified distribution of

*G*

^{M},

*X*), the child genotype can be missing, a distinction is made between the paternal and maternal origin of the alleles forming the child genotype, and the genotype of other polymorphisms in linkage disequilibrium with the target polymorphism can be included to improve inference of parental origin and missing genotypes. We write the retrospective likelihood assuming the possibly infinite-dimensional distribution of

*G*

^{M},

*X*: p

_{jm}= Pr(

*G*

^{M}=

*m*,

*X*=

*x*

_{j}), is left unspecified:

An alternative would be to use the double-additive model for Pr(*G*^{M}| *X*) and leave Pr(*X*) unspecified. In either case, HWE can be assumed to simplify the distribution of *G*^{M} to allele frequencies.

When the number of cases and non-cases in the population is known, *f* can be estimated as $f^=N1N0+N1$, or *f* can be obtained from external prevalence studies, which allows implementing the approach in retrospective sampling designs where the numbers of cases and controls are unknown. Although the uncertainty in the estimation of *f* should be accounted for in the variance of the regression coefficient estimates, simulations studies reported by [6] have shown that variance estimates neglecting this uncertainty are accurate when the population size *N*_{0} + *N*_{1} is large. Zhang et al. [6] profiled out p_{j} = Pr(*X* = *x*_{j}) while [8] profiled out p_{jm} from the likelihood, and modified the resulting profile likelihood by replacing by its limiting value the Lagrange multiplier enforcing the constraint that the disease prevalence equals *f*. They then showed that maximizing the modified profile likelihood avoided the saddle point problem and led to stable estimates with the same efficiency as the maximum likelihood estimates.

Tian et al.’s [8] expectation-maximization algorithm maximizes (5) as a special case of their more general likelihood. The R functions CCMO.na and CCMO.dep implement the retrospective modified likelihood with unspecified distribution of *G*^{M}, *X* and double-additive model for Pr(*G*^{M}| *X*), respectively. The data characteristics and requirements of the compared approaches are summarized in Table 1 and the software packages and functions implementing them are described in the online supplementary material (for all online suppl. material, see https://doi.org/10.1159/000529559), with a focus on the options required for analyzing case-mother-control-mother data.

## Results

### Evaluation on Simulated Data

*Y*was generated from model (1) and the environmental exposure

*X*

_{1}was related to the maternal genotype

*G*

^{M}through the linear model

*?*= log(1.5) for the dependence case and

*?*= 0 for the independence case, and

*e*has a standard normal distribution. We selected a model with a categorical environmental exposure

*X*

_{1}and no additional covariate (i.e.,

*p*= 1) so that all reviewed methods could be applied. To obtain a categorical exposure with 5 levels, the values of

*X*

_{1}were rounded to the nearest integer and values were floored to -2 and capped to 2. This leads to odds ratios (ORs) between consecutive values of

*X*

_{1}and

*G*

^{M}= 1 versus

*G*

^{M}= 0 approximately equal to 1.5, similar to what we observe in the case study. The effects of the model terms are set to realistic values presented in Tables 2 and 3.

Effects . | Independence^{b}
. | Dependence^{c}
. | ||||
---|---|---|---|---|---|---|

0 . | log(1.2)
. | 0 . | log(1.2) . | |||

Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) | ||||||

X_{1}× G^{M} | 6.0 | 56.2 | 4.8 | 56.0 | ||

X_{1}× G^{C} | 5.2 | 85.0 | 6.0 | 84.6 | ||

Retrospective likelihood, double-additive model for Pr(GM | X) (function CCMO.dep) | ||||||

X_{1}× G^{M} | 6.2 | 61.2 | 5.2 | 61.2 | ||

X_{1}× G^{C} | 4.8 | 84.8 | 6.0 | 85.0 | ||

Prospective likelihood (function Spmlficmcm)^{e} | ||||||

X_{1}× G^{M} | 6.2 | 40.8 | 7.0 | 38.8 | ||

X_{1}× G^{C} | 5.4 | 84.1 | 6.0 | 82.4 | ||

Logistic regression (function glm) | ||||||

X_{1}× G^{M} | 6.2 | 49.0 | 6.0 | 44.8 | ||

X_{1}× G^{C} | 5.0 | 49.2 | 4.4 | 48.8 | ||

Stratified retrospective likelihood (function gxe from Haplin) | ||||||

X_{1}× G^{M} | 3.6 | 52.6 | 71.8 | –^{f} | ||

X_{1}× G^{C} | 5.0 | 49.4 | 53.2 | – |

Effects . | Independence^{b}
. | Dependence^{c}
. | ||||
---|---|---|---|---|---|---|

0 . | log(1.2)
. | 0 . | log(1.2) . | |||

Retrospective likelihood, unspecified distribution of GM, X (function CCMO.na) | ||||||

X_{1}× G^{M} | 6.0 | 56.2 | 4.8 | 56.0 | ||

X_{1}× G^{C} | 5.2 | 85.0 | 6.0 | 84.6 | ||

Retrospective likelihood, double-additive model for Pr(GM | X) (function CCMO.dep) | ||||||

X_{1}× G^{M} | 6.2 | 61.2 | 5.2 | 61.2 | ||

X_{1}× G^{C} | 4.8 | 84.8 | 6.0 | 85.0 | ||

Prospective likelihood (function Spmlficmcm)^{e} | ||||||

X_{1}× G^{M} | 6.2 | 40.8 | 7.0 | 38.8 | ||

X_{1}× G^{C} | 5.4 | 84.1 | 6.0 | 82.4 | ||

Logistic regression (function glm) | ||||||

X_{1}× G^{M} | 6.2 | 49.0 | 6.0 | 44.8 | ||

X_{1}× G^{C} | 5.0 | 49.2 | 4.4 | 48.8 | ||

Stratified retrospective likelihood (function gxe from Haplin) | ||||||

X_{1}× G^{M} | 3.6 | 52.6 | 71.8 | –^{f} | ||

X_{1}× G^{C} | 5.0 | 49.4 | 53.2 | – |

^{a}Wald tests of the interaction terms except for the stratified retrospective likelihood where a trend test is performed. Other model coefficients were set to ß_{1} = log(1.8), ß_{2} = log(1.5) and ß_{3} = log(1.8). Type I error level was set to *a* = 0.05.

5% of the offspring genotypes were set to missing completely at random.

^{b}Independence between environmental exposure *X*_{1} and maternal genotype *G*^{M}.

^{c}Linear dependence between environmental exposure *X*_{1} and maternal genotype *G*^{M}.

^{d}Percentage of child genotypes missing completely at random.

^{e}Proportions of replicates where likelihood maximization failed were 3.2% and 17.2% for the two columns under dependance and 3*.*4% and 20*.*6% for the two columns under independance.

^{f}Power not displayed because of inflated Type I error.

Other simulation parameters were set as follows. We assumed a single homogeneous ancestral population: we set the minor allele frequency of the biallelic polymorphism to be 0.2 and adjusted the intercept parameter in the disease risk model so that the population prevalence *f* was equal to 0.01. At each of 500 replicates, a population of 200,000 mother-offspring pairs was simulated, and a sample of 1,000 case-mother and 1,000 control-mother pairs was drawn with replacement from the population. The correct value *f* = 0.01 was used for the retrospective likelihood and the actual number of cases and non-cases in the population *N*_{0} and *N*_{1} were used for the prospective likelihood. We set to missing the genotype of 5% or 20% of the children either chosen randomly or with a probability depending on the child genotype. Such a scenario where the child genotype is missing not at random could occur if the genotype impacted the health status of the child which in turn could impact willingness to provide a biological sample.

Results of the simulation with 5% of child genotypes missing completely at random (MCAR) are reported in Table 2 and Figures 1 and 2. For Haplin, we report results for linear combinations of the ß coefficient which are estimated in each stratum (? coefficients in *P*_{mc|ij} [?, µ]). For the independence case, bias was minimal for all methods. For the dependence case, the retrospective likelihood with the double-additive logistic model produced slightly biased estimates of the maternal genotype effects. The prospective likelihood also produced biased estimates of the maternal genotype main and interaction effects (possibly due to maximization problems, see below). The bias of stratum-specific estimates from the stratified retrospective likelihood was large in most strata. The other estimates were nearly unbiased. In both cases, the retrospective likelihood produced slightly lower standard errors with the double-additive model than with the unspecified distribution of *G*^{M}, *X*, both of which were lower than standard errors from the prospective likelihood and logistic regression. Maximization of the retrospective likelihood always succeeded, while maximization of the prospective likelihood failed for a substantial proportion of replicates and outlying coefficient estimates were obtained in the scenario under dependence for a small number of other replicates, which led to extremely large empirical standard errors. For the replicates where it succeeded, the standard error of the estimates tended to be underestimated, resulting in CI undercoverage. For the retrospective likelihood, CI coverage was close to nominal value both under dependence and independence with the unspecified distribution of *G*^{M}, *X* and under independence only with the double-additive logistic model. Undercoverage is noticeable for the maternal genotypic effect under dependence with the double-additive logistic model, and for most estimates from the stratified retrospective likelihood, particularly under dependence due to bias. Increasing the proportion of MCAR child genotypes to 20% did not produce a noticeable increase in empirical standard error (Fig. 2), but slight biases appeared or were increased for the maternal and child genotype effects under the modified retrospective likelihood (Fig. 1). Biases were more substantial and empirical standard errors larger when the child genotype was missing not at random under the extreme scenario we considered. The bias of stratum-specific estimates from the stratified retrospective likelihood was worse under the scenarios with 20% missing child genotypes (results not shown).

We also examined the type I error and power of tests of maternal and child genotypes effect modification by *X*_{1} (Table 3). Type I error was generally well controlled; only the trend test under stratified retrospective likelihood exhibited important type I error inflation under the dependence scenario only. The modified retrospective likelihood Wald test achieved the highest power among all methods, with the double-additive logistic model having a slight advantage over the unspecified distribution of *G*^{M}, *X*.

We investigated whether the bias of the stratified retrospective likelihood in the dependence scenario could be due to small sample within each stratum by multiplying the sample size by 10, and observed bias remained substantial despite empirical estimates of relative risk close to the expected value in the *G*^{C} × *G*^{M} × *Y* table in each stratum defined by *X*_{1} (results not shown). The stratified retrospective likelihood should yield unbiased estimates with correctly specified mating type frequencies, and the problem may be due to misspecification of the mating type frequencies in Haplin since mating symmetry is not satisfied under our simulation scenario with dependence.

In summary, as expected the modified retrospective likelihood with unspecified distribution of *G*^{M}, *X* was robust to dependence of the environmental exposure on the maternal genotype but led to slightly less precise estimates than the double-additive logistic model when the environmental exposure was independent of the maternal genotype. Maximization problems with the prospective likelihood led to poor performance on several metrics, and logistic regression was expectedly inefficient. The implementation of the stratified retrospective likelihood in Haplin led to biased estimates in this particular simulation setup.

### Case Study: CYP2E1, Drinking Water Disinfection By-Products and Small-For-Gestational-Age Babies Revisited

Trihalomethanes (THMs) are major by-products of drinking water disinfection by chlorination. The gene *CYP2E1* encodes part of the cytochrome P-450 which plays an important role in phase-1 biological activation of xenobiotics and particularly chloroform [13, 14] which is the main THM. It may also metabolize other THMs such as bromodichloromethane [15]. Infante-Rivard [16] first reported that the *CYP2E1* promoter variant C allele of the rs3813867 (G1259C) single nucleotide polymorphism in the genotype of newborns significantly increases the positive association of high levels of total THMs to which their mothers were exposed in drinking water during pregnancy with the risk that the newborn be small for gestational age (the modifying effect of the mother C allele was in the same direction but not statistically significant). They performed standard logistic regression of a case-mother-control-mother study of singleton births (430 cases and 412 controls), adjusted for potential confounding factors of the association between total THMs and risk of small-for-gestational-age babies but did not take into account the genotypes of the mother and newborn within the same regression model as was done in model (1).

Levallois et al. [9] attempted to replicate this modifying effect of the rs3813867 C allele in another case-mother-control-mother study. In their primary analysis, Levallois et al. [9] also performed logistic regression adjusting for potential confounding factors (fully adjusted model), but this time including both child and mother genotype in the same model for 1,427 mother-child pairs. That analysis provided no evidence of modifying effect of either the mother or child C allele. The C allele had low frequency in the study population (empirical frequency of 3% in the mothers of controls, similar to the population estimates from the prospective and retrospective likelihood analyses reported below), so that homozygous CC genotypes were absent in controls and their mothers, compatible with HWE expectations (exact test *p* = 1.0 in the controls, 0.64 in the mothers). The results are reproduced in Table 4, including the interaction OR and its 95% CI not reported previously. To overcome the inefficiency of logistic regression and exploit the information from mother-child pairs where child genotype was missing, the prospective likelihood (3) was applied, using a partially adjusted model containing only two major confounders (body mass index and parity) due to the limitation on the dimension of *X* for this approach. This partially adjusted analysis on an extended sample of 1,535 mother-child pairs after reintroducing pairs where the child genotype and other covariates were missing provided weak evidence of a modifying effect where high total THMs reduces the risk of being a case in the presence of a maternal C allele (interaction *p* = 0.09) while the null hypothesis that the child C allele had no modifying effect could not be rejected, contrary to the report by Infante-Rivard [16].

Method (N)
. | rs3813867 Genotype . | Partial adjustment^{a}
. | Full adjustment^{b}
. | ||||
---|---|---|---|---|---|---|---|

Str OR^{c} (95% CI)
. | Inter OR^{d} (95% CI)
. | Inter P^{e}
. | Str OR (95% CI) . | Inter OR (95% CI) . | P Inter . | ||

Logistic regression^{f} | Wild type | 1.3 (1.0–1.8) | 1.2 (0.8–1.8) | ||||

(1,427) | Newborn variant | 1.8 (0.4–7.6) | 1.3 (0.3–5.6) | 0.66 | 3.7 (0.7–19.1) | 2.0 (0.5–8.7) | 0.36 |

Mother variant | 0.5 (0.1–2.2) | 0.4 (0.1–1.8) | 0.21 | 0.4 (0.1–7.1) | 0.4 (0.1–1.9) | 0.23 | |

Prospective | Wild type | 1.3 (0.9–1.7) | NA | ||||

likelihood | Newborn variant | 1.9 (0.4–7.7) | 1.5 (0.4–6.2) | 0.59 | NA | NA | NA |

(1,528)^{g} | Mother variant | 0.4 (0.1–1.5) | 0.3 (0.1–1.2) | 0.09 | NA | NA | NA |

Retrospective | Wild type | 1.3 (0.9–1.7) | 1.3 (0.9–1.7) | ||||

likelihood - DA^{h} | Newborn variant | 2.0 (0.4–9.8) | 1.6 (0.4–6.2) | 0.49 | 2.4 (0.5–12.3) | 1.9 (0.5–7.5) | 0.37 |

(1,528) | Mother variant | 0.4 (0.1–1.4) | 0.3 (0.1–1.2) | 0.08 | 0.4 (0.1–1.5) | 0.3 (0.1–1.1) | 0.07 |

Retrospective | Wild type | 1.3 (0.9–1.7) | 1.3 (0.9–1.8) | ||||

likelihood - UN^{i} | Newborn variant | 2.0 (0.4–10.1) | 1.6 (0.4–6.3) | 0.49 | 2.4 (0.5–12.6) | 1.9 (0.5–7.7) | 0.37 |

(1,528) | Mother variant | 0.4 (0.1–1.5) | 0.3 (0.1–1.2) | 0.08 | 0.4 (0.1–1.5) | 0.3 (0.1–1.3) | 0.10 |

Method (N)
. | rs3813867 Genotype . | Partial adjustment^{a}
. | Full adjustment^{b}
. | ||||
---|---|---|---|---|---|---|---|

Str OR^{c} (95% CI)
. | Inter OR^{d} (95% CI)
. | Inter P^{e}
. | Str OR (95% CI) . | Inter OR (95% CI) . | P Inter . | ||

Logistic regression^{f} | Wild type | 1.3 (1.0–1.8) | 1.2 (0.8–1.8) | ||||

(1,427) | Newborn variant | 1.8 (0.4–7.6) | 1.3 (0.3–5.6) | 0.66 | 3.7 (0.7–19.1) | 2.0 (0.5–8.7) | 0.36 |

Mother variant | 0.5 (0.1–2.2) | 0.4 (0.1–1.8) | 0.21 | 0.4 (0.1–7.1) | 0.4 (0.1–1.9) | 0.23 | |

Prospective | Wild type | 1.3 (0.9–1.7) | NA | ||||

likelihood | Newborn variant | 1.9 (0.4–7.7) | 1.5 (0.4–6.2) | 0.59 | NA | NA | NA |

(1,528)^{g} | Mother variant | 0.4 (0.1–1.5) | 0.3 (0.1–1.2) | 0.09 | NA | NA | NA |

Retrospective | Wild type | 1.3 (0.9–1.7) | 1.3 (0.9–1.7) | ||||

likelihood - DA^{h} | Newborn variant | 2.0 (0.4–9.8) | 1.6 (0.4–6.2) | 0.49 | 2.4 (0.5–12.3) | 1.9 (0.5–7.5) | 0.37 |

(1,528) | Mother variant | 0.4 (0.1–1.4) | 0.3 (0.1–1.2) | 0.08 | 0.4 (0.1–1.5) | 0.3 (0.1–1.1) | 0.07 |

Retrospective | Wild type | 1.3 (0.9–1.7) | 1.3 (0.9–1.8) | ||||

likelihood - UN^{i} | Newborn variant | 2.0 (0.4–10.1) | 1.6 (0.4–6.3) | 0.49 | 2.4 (0.5–12.6) | 1.9 (0.5–7.7) | 0.37 |

(1,528) | Mother variant | 0.4 (0.1–1.5) | 0.3 (0.1–1.2) | 0.08 | 0.4 (0.1–1.5) | 0.3 (0.1–1.3) | 0.10 |

^{a}Odds ratios adjusted for prepregnancy body mass index (BMI) and parity.

^{b}Odds ratios adjusted for maternal age, maternal education, annual household income, prepregnancy BMI, parity, history of chronic disease, preeclampsia, active smoking during the third trimester, passive smoking throughout pregnancy, coffee and alcohol consumption during pregnancy and month of selection.

^{c}Stratum-specific odds ratio and 95% confidence interval.

^{d}Interaction odds ratio and 95% confidence interval.

^{e}Interaction *p*-value.

^{f}From [9].

^{g}7 mother-child pairs were removed compared to [9] due to missing values of covariates in the fully adjusted model

^{h}DA: double-additive model of the maternal C allele with total THM dichotomized at the 4th quartile as predictor.

^{i}UN, unspecified distribution for *G*^{M} and *X*.

Now that the modified retrospective likelihood allowing for partially missing genotypes (5) is available, we estimated the partially adjusted and the fully adjusted model on a common dataset of 1,528 mother-child pairs with all covariates available (321 cases and 1,207 controls), comprising the 1,427 mother-child pairs with observed child genotype and the 101 with missing child genotype, using the CCMO.na and CCMO.dep functions (we did not reintroduce 7 mother-child pairs where the mother genotype was missing because such pairs could only be used under the double-additive model of the maternal genotype implemented in CCMO.dep). We prespecified total THMs as a predictor of the maternal C allele in the double-additive model, and there was indeed some evidence of association between high total THMs and the maternal C allele detectable in the control sample (OR = 1.7, exact 95% CI: [1.0, 2.8], Fisher’s exact test *p* = 0.04) as well as through a comparison of the prespecified model to the model assuming independence of the maternal C allele from all covariates (likelihood ratio ?^{2} with 1 d.f.: 5.40, *p* = 0.02). Analyses of a simulated dataset with the methods applied here under the partially adjusted model are illustrated in the online supplementary material.

For the partially adjusted model, there is essentially no difference between the results from the prospective and retrospective likelihoods (Table 4). Although the total THM-maternal genotype interaction OR estimate remained 0.3 at first decimal place across all retrospective likelihood analyses reported in Table 4, the coefficient estimate and its estimated standard error varied between analyses. Under the double-additive logistic model, the coefficient estimate was slightly lower for the fully adjusted model than for the partially adjusted model (-1.30 vs. -1.25) while the estimated standard error remained similar (0.73 vs. 0.72), resulting in a slightly smaller interaction *p* value for the fully adjusted model. Under the unspecified distribution of *G*^{M}, *X* the same fully-to-partially adjusted model comparison revealed similar coefficients estimates (-1.26 vs. -1.25) but a larger estimated standard error (0.76 vs. 0.73) resulting in a slightly larger interaction *p* value for the fully adjusted model. More importantly, the fully adjusted logistic regression model produced an estimate of this coefficient (-0.87) much closer to 0 than the retrospective likelihood, with an estimated standard error (0.76) as large as under the retrospective likelihood with unspecified distribution of *G*^{M}, *X*, so that the interaction *p* value was the largest among all reported tests of the null hypothesis that this coefficient equals 0. The similarity of coefficient estimates from the double-additive logistic model with total THMs as covariate with those obtained when the distribution of *G*^{M}, *X* is left unspecified suggests this model is adequate. The interaction *p* value at 0.07 provides some evidence that the C allele in the mother leads to a negative association between total THMs and risk of small-for-gestational-age babies, in the opposite direction from the statistically nonsignificant increase for this association previously reported by [16]. Subsequent studies of the modifying effect of *CYP2E1* polymorphisms on the association between disinfection by-products and small-for-gestational-age babies reported results that are congruent with either an increase [17] or a reduction [18] of risk by high chlorination by-product exposure in presence of the maternal C allele, but important methodological differences (SNP genotyped, exposure measured, participant characteristics, outcome definition) prevent combining estimates from all studies to reach a conclusion.

## Discussion and Conclusion

Over the last decade, semiparametric estimation methods have been developed for more efficient analysis of data from case-mother-control-mother designs. With the availability of software to fit them, they should be preferred to the inefficient standard logistic regression. We evaluated the performance of readily usable functions to fit these models in the R statistical software environment on simulated data, focusing on gene-environment interaction effects, maternal genotype-environmental exposure relationship, and missing child genotype.

The limitation of the stratified retrospective likelihood to a categorical exposure variable without adjustment for covariates leads us to discourage its application to case-mother-control-mother data. The modified retrospective likelihood implemented in the CCMO.na and CCMO.dep R functions is particularly appealing, as it can handle arbitrary covariate vectors in terms of number of variables and number of levels of the variables, as highlighted in the case study, and partially missing offspring genotypes, although large proportions of missing genotypes can induce bias, which can be substantial if child genotypes are missing not at random. The unspecified distribution of *G*^{M}, *X* implemented in CCMO.na is recommended due to its robustness. However, when the double-additive logistic model for Pr(*G*^{M} | *X*) holds (including the independence case), CCMO.dep provides a slight gain in efficiency. The prospective likelihood allowing for partially missing child genotype remains an option for low-dimensional covariate vectors and is implemented in the R package SPmlficmcm offering the familiar R model specification.

Although we used a single regression model, the effect sizes in this model were realistic, multiplicative genotype-exposure interactions were either present or absent, and the limited complexity was designed to suit all approaches reviewed. Further evaluation of the modified retrospective likelihood could examine more complex models including multiple continuous covariates and model misspecification, e.g., missing interaction terms, as the flexibility of this approach allows to entertain complex models.

## Acknowledgments

We would like to thank co-investigators, trainees, and research assistants from Centre de recherche du CHU de Québec and Université Laval involved in the small-for-gestational-age case-mother-control-mother study of drinking water disinfection by-products and genetic factors: Molière Nguile-Makao, Manuel Rodriguez, Céline Campagna, Robert Tardif, Nathalie Bernard, Sylvie Desjardins, Catherine Gonthier, Christelle Legay, and Sylvie Marcoux.

## Statement of Ethics

The access to the birth certificates for the selection of cases and controls was authorized by the Commission d’accès à l’information of Québec. In the first study stage, participants provided written informed consent for collection of exposure and sociodemographic variables and, in the second study stage, reconsented to provide their own blood or saliva sample as well as cord blood or saliva sample from their child for genetic analyses. This study was approved by the CHU de Québec Ethics Review Board (project 2012-554) and conforms to the Declaration of Helsinki.

## Conflict of Interest Statement

The authors have no conflicts of interest to declare.

## Funding Sources

This work was supported by the National Natural Science Foundation of China [Grant No. 12171451, 72091212, 11771096], the National Institutes of Health of the USA [Grant No. R01ES016626], and the Canadian Institutes for Health Research [Grant No. MOP 102708]. The funders were not involved in study design; collection, analysis, and interpretation of data; writing of the report; and imposed no restrictions regarding the submission of the report for publication.

## Author Contributions

Alexandre Bureau planned the study, conducted the simulations and the case study, and drafted the manuscript. Yuang Tian programmed the CCMO.dep and CCMO.na functions, wrote part of the simulation code, and formulated suggestions integrated in the manuscript text. Patrick Levallois and Yves Giguère conceived and oversaw the original study of small-for-gestational-age babies on which the case study is based. Jinbo Chen and Hong Zhang formulated suggestions integrated in the manuscript text. All authors revised the manuscript and approved the final version.

## Data Availability Statement

The R packages SPmlficmcm and Haplin are available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=SPmlficmcm and http://CRAN.R-project.org/package=Haplin. The functions CCMO.na and CCMO.dep are available on GitHub at https://github.com/yatian20/CCMO.na and https://github.com/yatian20/CCMO.dep, respectively. Scripts in R code to reproduce the simulation study reported in Tables 2 and 3 and the empirical distribution of a subset of variables from the case study data along with code to simulate a dataset similar to the actual one are available at https://github.com/abureau/CMCMsim. The data of the case study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.