Abstract
Binary and censored variables can lead to erroneous inferences about heritability in family studies if the dichotomous or censored nature of the dependent variable is not properly accounted for. The bivariate probit and tobit models proposed in this paper provide a unified approach to family studies with binary, ordered, and censored variables. Each model in this paper is derived from a similar latent-variable structure which can contain covariates that affect the expected value of the dependent variable, as well as genetic and shared environmental influences that lead to an association among related individuals. We apply the models to the fertility outcome and fertility motivations of Danish twins born 1953–64 and find relevant genetic influences on the number of children as well as the desired timing of the first child. Keywords: twin studies, bivariate probit, bivariate censored regression Back to [ Hans-Peter Kohler] [ Data and Programs] |
In Kohler and Rodgers (1999) we have introduced bivariate (ordered) probit methods for the estimation of heritabilities and shared environmental effects with twin data. These methods are suitable when the phenotype of interest is either a binary or ordered variable. Our initial interest was primarily the number of children, but these programs are also applicable to many other binary and ordered traits.
The programs provided below are subject to our disclaimer and are written for STATA version 6.0. In order to install the programs, you need to download the file twbiopr.zip and unzip it in your personal ado directory. This adds the files twbiopr.ado, twbiopr.hlp, twb2lf.ado, twbolf.ado to this directory.
The syntax of the command twbiopr is as follows (from help files):
------------------------------------------------------------------------------- help for twbiopr ------------------------------------------------------------------------------- Estimation of bivariate (ordered) probit models for twin data ------------------------------------------------------------- twbiopr depvar1 depvar2 [weight] [if exp] [in range] [, xbeqc(varlist) xbeqi(varlist) rhoeq(varlist) maxcat(#) truncate norhocons rhogues(numlist) trace] fweights and pweights are allowed; see help weights. twbiopr shares the features of all estimation commands; see help est (except, predict is not implemented). Description ----------- twbiopr performs the maximum likelihood estimation of bivariate (ordered) probit models for twin data (see Kohler and Rodgers, 1999). The data set needs to be in a "wide" form so that the information for a twin pair is in one line of the dataset. Moreover, variables that contain individual-specific information (e.g., education, weight, etc.) need to be indicated with a suffix 1 or 2 for twin 1 and 2 in a pair. This naming convention is critical because the program looks for variables for twin 1 and 2 according to this rule. Variables that pertain to the twin pair (e.g., age or birth year) can be named without restrictions. The variables depvar1 and depvar2 contain the phenotype of interest. These dependent variable must contain integers that range from zero to maxcat, so that the total number of categories (= unique realizations of depvar1 and depvar2) is maxcat + 1. The application of bivariate (ordered) probit models is similar to the estimation of polychoric estimations. However, the estimation allows additional for the inclusion of covariates that capture shifting mean realizations of the phenotype according to observed characteristics of the twin pair or each individual within a pair. These variables are specified using xbeqc(varlist) for variables that pertain to the pair (i.e., are equal for both twins within a pair) and xbeqi(varlist) for individual-specific covariates. See http://user.demogr.mpg.de/kohler for further examples. Options for twbiopr -------------------- xbeqc(varlist) specifies twin-pair specific variables that affect the mean realization of the phenotype. xbeqi(varlist) specifies individual specific variables that affect the mean realization of the phenotype. The varlist given in contains only the stem of the variable name. The full name of the variable must end in either 1 or 2 depending on whether the information pertains to the first or second twin in a pair (e.g., xbeqi(educ) is used if the variables educ1 and educ2 include the education for twin 1 and 2. rhoeq(varlist) specifies the equation for the correlation between twin 1 and 2 (see example below). maxcat specifies the maximum value of the dependent variable. If not specified, the option is set to the maximum value of depvar1 and depvar2. truncate specifies that all values of the dependent variable that exceed the value specified in maxcat are set equal to maxcat. This allows the combination of categories with only very few observation in the upper tail of the distribution. The data is restored to the original data after the estimation. norhocons specifies that no constant is added to the varlist given in rhoeq in the estimation. This is useful when rhoeq is specified in a way that already includes a constant. rhogues(numlist) specifies starting values for the parameters of rhoeq. The number of values in numlist must equal the number of variables given in rhoeq(varlist) plus one (if norhocons is not specified) or the number number of variables in rhoeq(varlist) (if norhocons is specified) trace allows to trace the mle estimation. Examples -------- .g c2 = 1 .g h2 = Radd .twbiopr nkids1 nkids2 if female == 1, xbeqc(birthy) xbeqi(educ) rhoeq(c2 h2) maxcat(3) truncate rhogues(0,.5) norhocons Note: the variable Radd contains the additve genetic relatedness of a twin pair, and the specification above yields direct estimates of the heritability (h2) and shared environmental influences (c2) for the phenotype nkids, accounting for the fact that the mean of nkids depnds on birth year (birthy) and education (educ1 and educ2 for twin 1 and 2). |
Following are some more detailed examples that are obtained by running the do-file twbiodem-analysis on a data set of twin fertility (described in Kohler and Rodgers 1999). The corresponding log file (with comments about the estimation) is:
> *! Demonstration of bivariate (ordered) probit models > *! for twin data; . *! load data > *! this is a wide dataset in which each twin pair is one line > *! in the dataset and the information for twin 1 and 2 in a > *! pair are indicated with the suffice 1 and 2 to each > *! individual-specific variable; . use twbiodem-wtwpairs, clear; (ss DZ + MZ: sinlge entry, wide) . format sex zygos %12.0g; . format birthy nkids1 nkids2 oyelem1 oyelem2 %6.0g; . *! LIST PRIMARY VARIABLES IN DATA > *! Variables are: > *! twpair: id of twin pair > *! sex sex (same sex twin pairs only) > *! zygos zygosity > *! birthy birth year of twin pair > *! nkids number of children at age 35 > *! oyelem years of elementary education > *! oyseco years of secondary education > *! ; . list sex zygos birthy nkids1 nkids2 oyelem1 oyelem2 in 1/10, nodisplay; sex zygos birthy nkids1 nkids2 oyelem1 oyelem2 1. Male MZ 53 0 2 7 8 2. Female DZ same sex 53 0 1 12 13 3. Male MZ 53 0 0 13 13 4. Female MZ 53 2 2 10 10 5. Male DZ same sex 53 2 0 9 9 6. Male MZ 53 0 0 12 12 7. Female DZ same sex 53 2 2 10 10 8. Female DZ same sex 53 3 2 13 13 9. Male MZ 53 2 3 9 9 10. Male DZ same sex 53 1 2 12 13 . *! SUMMARY OF THESE VARIABLES; . tab1 sex zygos; -> tabulation of sex R: sex | Freq. Percent Cum. ------------+----------------------------------- Female | 1314 50.62 50.62 Male | 1282 49.38 100.00 ------------+----------------------------------- Total | 2596 100.00 -> tabulation of zygos zygosity of twin | Freq. Percent Cum. ---------------------+----------------------------------- DZ same sex | 1577 60.75 60.75 MZ | 1019 39.25 100.00 ---------------------+----------------------------------- Total | 2596 100.00 . su birthy nkids1 nkids2 oyelem1 oyelem2; Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- birthy | 2596 58.46456 3.072015 53 63 nkids1 | 2596 1.505008 1.076807 0 6 nkids2 | 2596 1.520031 1.075199 0 6 oyelem1 | 2596 10.86171 1.904801 1 26 oyelem2 | 2596 10.81857 1.816595 2 26 . *! CROSS TAB OF NUMBER OF CHILDREN AND > *! PEARSON CORRELATIONS BY SEX AND ZYGOSITY; . tab nkids*; | 2 nk35x 1 nk35x | 0 1 2 3 4 | Total -----------+-------------------------------------------------------+---------- 0 | 242 144 189 41 13 | 630 1 | 109 110 205 55 4 | 484 2 | 193 211 503 165 29 | 1103 3 | 48 52 132 65 12 | 309 4 | 12 4 26 11 7 | 61 5 | 1 2 3 1 1 | 8 6 | 0 0 0 1 0 | 1 -----------+-------------------------------------------------------+---------- Total | 605 523 1058 339 66 | 2596 | 2 nk35x 1 nk35x | 5 6 | Total -----------+----------------------+---------- 0 | 1 0 | 630 1 | 1 0 | 484 2 | 2 0 | 1103 3 | 0 0 | 309 4 | 0 1 | 61 5 | 0 0 | 8 6 | 0 0 | 1 -----------+----------------------+---------- Total | 4 1 | 2596 . sort sex zygos; . by sex zygos: corr nkids*; -> sex= Female zygos= DZ same sex (obs=795) | nkids1 nkids2 ---------+------------------ nkids1 | 1.0000 nkids2 | 0.1523 1.0000 -> sex= Female zygos= MZ (obs=519) | nkids1 nkids2 ---------+------------------ nkids1 | 1.0000 nkids2 | 0.3874 1.0000 -> sex= Male zygos= DZ same sex (obs=782) | nkids1 nkids2 ---------+------------------ nkids1 | 1.0000 nkids2 | 0.1629 1.0000 -> sex= Male zygos= MZ (obs=500) | nkids1 nkids2 ---------+------------------ nkids1 | 1.0000 nkids2 | 0.2925 1.0000 . *! ESTIMATION OF BIVARIATE PROBIT MODEL FOR HAVING > *! AT LEAST ON CHILD > *! > *! because there may be a cohort trend in childlessness, we include > *! birth year among the covariates; . *! estimation only for females; . for num 1/2: g anykidX = (nkidsX > 0) if nkidsX != .; -> g anykid1 = (nkids1 > 0) if nkids1 != . -> g anykid2 = (nkids2 > 0) if nkids2 != . . g c2 = 1; . g h2 = Radd; . *! note: the option norhocons is included because we defined a > *! varable c2 that is a constant; . twbiopr anykid1 anykid2 if female == 1, > xbeqc(birthy) > rhoeq(c2 h2) > norhocons; initial: log likelihood = -1291.7891 rescale: log likelihood = -1291.7891 rescale eq: log likelihood = -1291.7891 Iteration 0: log likelihood = -1291.7891 Iteration 1: log likelihood = -1263.9128 Iteration 2: log likelihood = -1263.7387 Iteration 3: log likelihood = -1263.7387 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: anykid1 anykid2 method: twbiopr number of categories: 2 tw-pair specific vars: birthy twin 1 specific vars: twin 2 specific vars: options: norhocons Bivariate probit for twin analysis Number of obs = 1314 Wald chi2(1) = 0.19 Log likelihood = -1263.7387 Prob > chi2 = 0.6610 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | .0044455 .0101384 0.438 0.661 -.0154254 .0243163 _cons | .5991919 .5965704 1.004 0.315 -.5700647 1.768448 ---------+-------------------------------------------------------------------- rhoeq | c2 | -.1257573 .1501941 -0.837 0.402 -.4201323 .1686177 h2 | .6836169 .1909519 3.580 0.000 .3093581 1.057876 ------------------------------------------------------------------------------ . *! the same estimation can also be combined for males and > *! females, where include a dummy for females to account for > *! differences in the level of childlessness across sex; . g male_c2 = female == 0; . g fem_c2 = female; . g male_h2 = Radd * (female == 0); . g fem_h2 = Radd * female; . twbiopr anykid1 anykid2, > xbeqc(birthy female) > rhoeq(male_c2 male_h2 fem_c2 fem_h2) > norhocons; initial: log likelihood = -2820.0599 rescale: log likelihood = -2820.0599 rescale eq: log likelihood = -2820.0599 Iteration 0: log likelihood = -2820.0599 Iteration 1: log likelihood = -2766.421 Iteration 2: log likelihood = -2766.2015 Iteration 3: log likelihood = -2766.2015 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: anykid1 anykid2 method: twbiopr number of categories: 2 tw-pair specific vars: birthy female twin 1 specific vars: twin 2 specific vars: options: norhocons Bivariate probit for twin analysis Number of obs = 2596 Wald chi2(2) = 47.36 Log likelihood = -2766.2015 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | -.0029282 .0068627 -0.427 0.670 -.0163788 .0105225 female | .2902611 .0422176 6.875 0.000 .2075162 .373006 _cons | .7423233 .4000285 1.856 0.063 -.0417182 1.526365 ---------+-------------------------------------------------------------------- rhoeq | male_c2 | .0356384 .136298 0.261 0.794 -.2315007 .3027776 male_h2 | .4105174 .1794228 2.288 0.022 .0588552 .7621796 fem_c2 | -.1221149 .1500639 -0.814 0.416 -.4162348 .1720051 fem_h2 | .6791143 .190996 3.556 0.000 .304769 1.053459 ------------------------------------------------------------------------------ . *! since this model yields insignificant values for c2 for > *! both males and females, we may want to re-estimate as; . twbiopr anykid1 anykid2, > xbeqc(birthy female) > rhoeq(male_h2 fem_h2) > norhocons; initial: log likelihood = -2820.0599 rescale: log likelihood = -2820.0599 rescale eq: log likelihood = -2820.0599 Iteration 0: log likelihood = -2820.0599 Iteration 1: log likelihood = -2766.799 Iteration 2: log likelihood = -2766.5691 Iteration 3: log likelihood = -2766.5691 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: anykid1 anykid2 method: twbiopr number of categories: 2 tw-pair specific vars: birthy female twin 1 specific vars: twin 2 specific vars: options: norhocons Bivariate probit for twin analysis Number of obs = 2596 Wald chi2(2) = 47.41 Log likelihood = -2766.5691 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | -.0030367 .0068695 -0.442 0.658 -.0165008 .0104273 female | .290985 .0422987 6.879 0.000 .2080811 .3738889 _cons | .7485322 .4004135 1.869 0.062 -.0362638 1.533328 ---------+-------------------------------------------------------------------- rhoeq | male_h2 | .4547142 .0581831 7.815 0.000 .3406773 .568751 fem_h2 | .5302304 .0628146 8.441 0.000 .4071161 .6533446 ------------------------------------------------------------------------------ . *! and we can test of whether the male and female h2 are different > *! as; . test male_h2 = fem_h2; ( 1) [rhoeq]male_h2 - [rhoeq]fem_h2 = 0.0 chi2( 1) = 0.78 Prob > chi2 = 0.3778 . *! finally, since the years of elementary education may be a > *! determint of having any children, we add the years of > *! education as an individual-specific covariate; . twbiopr anykid1 anykid2 if female == 1, > xbeqc(birthy) > xbeqi(oyelem) > rhoeq(c2 h2) > norhocons; initial: log likelihood = -1285.2869 rescale: log likelihood = -1285.2869 rescale eq: log likelihood = -1285.2869 Iteration 0: log likelihood = -1285.2869 Iteration 1: log likelihood = -1258.731 Iteration 2: log likelihood = -1258.5725 Iteration 3: log likelihood = -1258.5725 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: anykid1 anykid2 method: twbiopr number of categories: 2 tw-pair specific vars: birthy twin 1 specific vars: oyelem1 twin 2 specific vars: oyelem2 options: norhocons Bivariate probit for twin analysis Number of obs = 1314 Wald chi2(1) = 0.76 Log likelihood = -1258.5725 Prob > chi2 = 0.3842 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | .008912 .0102419 0.870 0.384 -.0111618 .0289859 _cons | .9693345 .6081683 1.594 0.111 -.2226534 2.161322 ---------+-------------------------------------------------------------------- oyelem | _cons | -.0567758 .0176626 -3.214 0.001 -.0913939 -.0221577 ---------+-------------------------------------------------------------------- rhoeq | c2 | -.1382553 .1516297 -0.912 0.362 -.435444 .1589334 h2 | .6897588 .1930029 3.574 0.000 .31148 1.068038 ------------------------------------------------------------------------------ . *! NUMBER OF CHILDREN IN CATEGORIES 0, 1, 2, 3 and more; . *! because only relatively few indiviuals have more than > *! three children (see above), we combine all of them into a > *! single category 3+ children > *! this can be done by adding the truncate option in twbiopr > *! ; . twbiopr nkids1 nkids2 if female == 1, > xbeqc(birthy) > rhoeq(c2 h2) > maxcat(3) truncate > norhocons; Variables nkids1 and nkids2 have been trucated at 3 data will be restored after analysis initial: log likelihood = -3419.138 rescale: log likelihood = -3419.138 rescale eq: log likelihood = -3419.138 Iteration 0: log likelihood = -3419.138 Iteration 1: log likelihood = -3374.216 Iteration 2: log likelihood = -3372.1509 Iteration 3: log likelihood = -3372.1483 Iteration 4: log likelihood = -3372.1483 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: nkids1 nkids2 method: twbiopr number of categories: 4 tw-pair specific vars: birthy twin 1 specific vars: twin 2 specific vars: options: norhocons Bivariate ordered probit for twin analysis Number of obs = 1314 Wald chi2(1) = 4.84 Log likelihood = -3372.1483 Prob > chi2 = 0.0279 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | .0168926 .0076821 2.199 0.028 .0018359 .0319492 _cons | -.1302755 .4521797 -0.288 0.773 -1.016531 .7559804 ---------+-------------------------------------------------------------------- lnc1 | _cons | -.526162 .0400624 -13.134 0.000 -.6046829 -.4476412 ---------+-------------------------------------------------------------------- lnc2 | _cons | .2031461 .0252451 8.047 0.000 .1536667 .2526256 ---------+-------------------------------------------------------------------- rhoeq | c2 | -.0902567 .0887411 -1.017 0.309 -.2641861 .0836726 h2 | .51919 .1150935 4.511 0.000 .2936109 .7447691 ------------------------------------------------------------------------------ . *! again, this estimation can be combined for both sexes as > *! (given our estimates above, we also specify starting values); . twbiopr nkids1 nkids2, > xbeqc(birthy female) > rhoeq(male_c2 male_h2 fem_c2 fem_h2) > maxcat(3) truncate > rhogues(.2,.3,.15,.5) > norhocons; Variables nkids1 and nkids2 have been trucated at 3 data will be restored after analysis initial: log likelihood = -6755.2957 rescale: log likelihood = -6755.2957 rescale eq: log likelihood = -6703.3604 Iteration 0: log likelihood = -6703.3604 Iteration 1: log likelihood = -6696.4537 Iteration 2: log likelihood = -6696.2567 Iteration 3: log likelihood = -6696.2561 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: nkids1 nkids2 method: twbiopr number of categories: 4 tw-pair specific vars: birthy female twin 1 specific vars: twin 2 specific vars: options: norhocons Bivariate ordered probit for twin analysis Number of obs = 2596 Wald chi2(2) = 50.09 Log likelihood = -6696.2561 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | .0139016 .0054181 2.566 0.010 .0032823 .0245209 female | .2092427 .0333155 6.281 0.000 .1439455 .2745399 _cons | -.2033732 .3157383 -0.644 0.519 -.8222088 .4154625 ---------+-------------------------------------------------------------------- lnc1 | _cons | -.6025694 .0288364 -20.896 0.000 -.6590877 -.546051 ---------+-------------------------------------------------------------------- lnc2 | _cons | .1841888 .0185525 9.928 0.000 .1478264 .2205511 ---------+-------------------------------------------------------------------- rhoeq | male_c2 | .0741291 .0900222 0.823 0.410 -.1023112 .2505693 male_h2 | .2471484 .1182077 2.091 0.037 .0154655 .4788313 fem_c2 | -.0919832 .09152 -1.005 0.315 -.2713591 .0873927 fem_h2 | .535359 .1176755 4.549 0.000 .3047193 .7659987 ------------------------------------------------------------------------------ . *! finally, we re-estimate the first model, adding education > *! (elementary and secondary) as an individual-specific covarate; . twbiopr nkids1 nkids2 if female == 1, > xbeqc(birthy) > xbeqi(oyelem oyseco) > rhoeq(c2 h2) > maxcat(3) truncate > rhogues(0,.5) > norhocons; Variables nkids1 and nkids2 have been trucated at 3 data will be restored after analysis initial: log likelihood = -3368.665 rescale: log likelihood = -3368.665 rescale eq: log likelihood = -3368.665 Iteration 0: log likelihood = -3368.665 Iteration 1: log likelihood = -3360.7153 Iteration 2: log likelihood = -3360.6696 Iteration 3: log likelihood = -3360.6696 ******************************************************************************* BIVARIATE (ORDERED) PROBIT ESTIMATION FOR TWIN DATA: dependent variables: nkids1 nkids2 method: twbiopr number of categories: 4 tw-pair specific vars: birthy twin 1 specific vars: oyelem1 oyseco1 twin 2 specific vars: oyelem2 oyseco2 options: norhocons Bivariate ordered probit for twin analysis Number of obs = 1314 Wald chi2(1) = 6.17 Log likelihood = -3360.6696 Prob > chi2 = 0.0130 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- xbeqc | birthy | .0192335 .0077427 2.484 0.013 .0040581 .034409 _cons | .233386 .4586173 0.509 0.611 -.6654874 1.132259 ---------+-------------------------------------------------------------------- oyelem | _cons | -.0398152 .0132084 -3.014 0.003 -.0657031 -.0139273 ---------+-------------------------------------------------------------------- oyseco | _cons | -.0407193 .0155765 -2.614 0.009 -.0712486 -.01019 ---------+-------------------------------------------------------------------- lnc1 | _cons | -.5187502 .040049 -12.953 0.000 -.5972449 -.4402556 ---------+-------------------------------------------------------------------- lnc2 | _cons | .2087939 .0252234 8.278 0.000 .1593568 .2582309 ---------+-------------------------------------------------------------------- rhoeq | c2 | -.1018014 .0893015 -1.140 0.254 -.2768292 .0732264 h2 | .5208983 .1161629 4.484 0.000 .2932233 .7485733 ------------------------------------------------------------------------------ . log close; |
Back to [ Hans-Peter Kohler] [ Data and Programs]