Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
- PMID: 20085642
- PMCID: PMC2824146
- DOI: 10.1186/1471-2288-10-7
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
Abstract
Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.
Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.
Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.
Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Figures
Similar articles
-
Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.BMC Med Res Methodol. 2010 Dec 31;10:112. doi: 10.1186/1471-2288-10-112. BMC Med Res Methodol. 2010. PMID: 21194416 Free PMC article.
-
Approaches for missing covariate data in logistic regression with MNAR sensitivity analyses.Biom J. 2020 Jul;62(4):1025-1037. doi: 10.1002/bimj.201900117. Epub 2020 Jan 20. Biom J. 2020. PMID: 31957905
-
Accounting for bias due to outcome data missing not at random: comparison and illustration of two approaches to probabilistic bias analysis: a simulation study.BMC Med Res Methodol. 2024 Nov 13;24(1):278. doi: 10.1186/s12874-024-02382-4. BMC Med Res Methodol. 2024. PMID: 39538117 Free PMC article.
-
Review: a gentle introduction to imputation of missing values.J Clin Epidemiol. 2006 Oct;59(10):1087-91. doi: 10.1016/j.jclinepi.2006.01.014. Epub 2006 Jul 11. J Clin Epidemiol. 2006. PMID: 16980149 Review.
-
Predictors of clinical outcome in pediatric oligodendroglioma: meta-analysis of individual patient data and multiple imputation.J Neurosurg Pediatr. 2018 Feb;21(2):153-163. doi: 10.3171/2017.7.PEDS17133. Epub 2017 Dec 1. J Neurosurg Pediatr. 2018. PMID: 29192869 Review.
Cited by
-
Is using multiple imputation better than complete case analysis for estimating a prevalence (risk) difference in randomized controlled trials when binary outcome observations are missing?Trials. 2016 Jul 22;17:341. doi: 10.1186/s13063-016-1473-3. Trials. 2016. PMID: 27450066 Free PMC article.
-
Multidimensional individualized nutritional therapy for individuals with severe chronic obstructive pulmonary disease: study protocol for a registry-based randomized controlled trial.Trials. 2023 Feb 6;24(1):86. doi: 10.1186/s13063-023-07099-1. Trials. 2023. PMID: 36747276 Free PMC article.
-
Accounting for missing data in statistical analyses: multiple imputation is not always the answer.Int J Epidemiol. 2019 Aug 1;48(4):1294-1304. doi: 10.1093/ije/dyz032. Int J Epidemiol. 2019. PMID: 30879056 Free PMC article.
-
The relationship between wasting and stunting in Cambodian children: Secondary analysis of longitudinal data of children below 24 months of age followed up until the age of 59 months.PLoS One. 2021 Nov 18;16(11):e0259765. doi: 10.1371/journal.pone.0259765. eCollection 2021. PLoS One. 2021. PMID: 34794170 Free PMC article.
-
Imputation of missing values of tumour stage in population-based cancer registration.BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129. BMC Med Res Methodol. 2011. PMID: 21929796 Free PMC article.
References
-
- Vach W, Blettner M, Armitage P, Colton T. Encyclopedia of Biostatistics. New York: John Wiley & Sons; 1998. Missing data in epidemiologic studies; pp. 2641–2654.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous