Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Jan 19:10:7.
doi: 10.1186/1471-2288-10-7.

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Affiliations
Comparative Study

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Andrea Marshall et al. BMC Med Res Methodol. .

Abstract

Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.

Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.

Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.

Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

PubMed Disclaimer

Figures

Figure 1
Figure 1
a: Distribution of the covariates for the German breast cancer dataset; b: Distribution of the transformed continuous covariates in the German breast cancer dataset.
Figure 2
Figure 2
Regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
Figure 3
Figure 3
Average standard error (SE) estimates for different missing data methods for increasing percentage of MAR missingness
Figure 4
Figure 4
Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
Figure 5
Figure 5
Significance of the covariates in the prognostic model for different missing data methods and increasing percentage of MAR missingness.
Figure 6
Figure 6
Model performance measures for different missing data methods for increasing percentage of MAR missingness. a) Likelihood ratio test, b) Nagelkerke R2 statistic, c) Prognostic separation D statistic and d) Predicted 2-year survival from Cox model.
Figure 7
Figure 7
Comparison of the regression coefficient estimates for the different MI methods after imposing MAR and MNAR mechanisms.
Figure 8
Figure 8
Comparison of coverage estimates for the different MI methods after imposing MAR and MNAR mechanisms.

Similar articles

Cited by

References

    1. Burton A, Altman DG. Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer. 2004;91(1):4–8. doi: 10.1038/sj.bjc.6601907. - DOI - PMC - PubMed
    1. Vach W, Blettner M, Armitage P, Colton T. Encyclopedia of Biostatistics. New York: John Wiley & Sons; 1998. Missing data in epidemiologic studies; pp. 2641–2654.
    1. Demissie S, LaValley MP, Horton NJ, Glynn RJ, Cupples LA. Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Statistics in Medicine. 2003;22(4):545–557. doi: 10.1002/sim.1340. - DOI - PubMed
    1. Lipsitz SR, Ibrahim JG. Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis. 1996;2(1):5–14. doi: 10.1007/BF00128467. - DOI - PubMed
    1. Lipsitz SR, Ibrahim JG. Estimating equations with incomplete categorical covariates in the Cox model. Biometrics. 1998;54(3):1002–1013. doi: 10.2307/2533852. - DOI - PubMed

Publication types

LinkOut - more resources