Article’s

ANALYZING THE EFFECT OF DATA IMPUTATION TECHNIQUES ON CLINICAL PREDICTION MODELING

Roma Chaurasia

(05 – 2026)

DOI: 10.5281/zenodo.20233219

 

In the rapidly evolving landscape of healthcare AI, clinical prediction models hold immense promise for early disease detection, patient triaging, and personalized treatment. However, the real-world clinical datasets powering these models are notoriously imperfect frequently plagued by missing values due to irregular patient monitoring, disjointed electronic health records (EHR), or human error. How we handle these data gaps can ultimately make or break a model’s clinical viability. The current work examines systematically the effect that several data imputation procedures have on the efficiency, equity, and validity of predictive models. We examine a broad range of missing data handling techniques, starting from straightforward conventional ones (such as mean/median imputations), conventional statistical procedures like multiple imputation by chained equations (MICE), up to sophisticated algorithms for data imputation such as k-nearest neighbors (kNN), or even deep learning. To do this, we use a set of clinical data sets with several missing data patterns (MCAR, MAR, and MNAR) and then estimate the performance of the predictive models developed. Our study shows that the imputation method is not only a pre-processing step, but an important design step that greatly affects the sensitivity of the model, be it bias present or absent, and finally the fairness of the predictions of the algorithm. Ultimately, our work provides researchers and data scientists with a way to choose the imputation technique that is most appropriate for their clinical case.

 

 

Scroll to Top