Evaluation of MICE-based machine learning models for reconstructing missing climate data in the Urmia Lake basin

Document Type : Research/Original/Regular Article

Authors

Department of Water Science and Engineering, College of Agriculture, Isfahan University of Technology, Isfahan, Iran

Abstract

Extended Abstract
Introduction
Complete and continuous climatic datasets are fundamental for reliable analyses in hydrology, climate change assessment, water resources management, and environmental modeling. However, observational climate records frequently suffer from missing values due to instrument malfunction, station relocation, data transmission errors, or long-term interruptions in measurements. If not appropriately addressed, missing data can introduce bias, reduce statistical power, and compromise the reliability of subsequent modeling and decision-making processes. This challenge is particularly critical in regions with complex climatic variability and environmental sensitivity, such as the Lake Urmia Basin in northwestern Iran. Traditional approaches for handling missing climatic data, including listwise deletion or simple statistical substitution (e.g., mean or median imputation), are computationally convenient but often distort the statistical structure of the data and fail to capture inter-variable dependencies. In response to these limitations, advanced multivariate and machine-learning-based imputation methods have gained increasing attention. Among them, Multiple Imputation by Chained Equations (MICE) has emerged as a robust framework that accounts for uncertainty and exploits relationships among multiple variables.
Recent studies suggest that integrating MICE with machine learning algorithms can further enhance imputation accuracy, particularly for non-linear and highly interdependent climatic variables. Nevertheless, comprehensive evaluations comparing different MICE-based hybrid models across multiple climatic variables and stations remain limited. Therefore, this study aims to systematically assess and compare the performance of standard MICE and four hybrid approaches MICE-Linear Regression (MICE–LR), MICE- Decision Tree (MICE–DT), MICE-K-Nearest Neighbor (MICE–KNN), and MICE- Support Vector Machine (MICE–SVM), across a wide range of climatic variables and meteorological stations within the Lake Urmia Basin.
Materials and Methods
This study was conducted using daily climatic data from six synoptic meteorological stations located in the Lake Urmia Basin. The dataset includes a diverse set of climatic variables representing thermal conditions, atmospheric moisture, cloudiness, wind characteristics, radiation and energy balance, and sea-level pressure. To ensure consistency and robustness, all variables were preprocessed through quality control procedures, including outlier detection and temporal consistency checks. Missing data were reconstructed using five imputation approaches: standard MICE and four hybrid MICE-based models (MICE-LR, MICE-DT, MICE-KNN, and MICE-SVM). The imputation procedure was implemented iteratively within the chained equations framework to ensure convergence and stability of the reconstructed values.
Model performance was evaluated using multiple complementary statistical metrics, including the coefficient of determination (R²), normalized root means square error (NRMSE), Kling–Gupta Efficiency (KGE), and percent bias (PBIAS). These metrics collectively assess accuracy, variability representation, correlation structure, and systematic bias. In addition to predictive performance, computational efficiency was assessed by measuring the average execution time of each model. The evaluation framework was designed to enable comparisons from three perspectives: climate-variable-based, model-based, and station-based analyses.
Results and Discussion
The comparative analysis revealed substantial differences in imputation performance among the evaluated models, depending on the type of climatic variable and station characteristics. Overall, hybrid MICE-based models demonstrated superior performance compared to the standard MICE approach, particularly for temperature-related variables and atmospheric moisture parameters. Among the hybrid models, MICE–DT achieved comparatively higher KGE values for several variables, highlighting its ability to model non-linear interactions. Nevertheless, both MICE–DT and MICE–LR provided a more balanced trade-off between reconstruction accuracy and computational efficiency.
In contrast, MICE–KNN and MICE–SVM exhibited variable performance, with notable sensitivity to station-specific conditions and variable type. While these models performed reasonably well for certain variables, their performance deteriorated for others, especially in cases involving higher variability or weaker spatial coherence. Standard MICE and MICE–LR showed comparable results, suggesting that linear assumptions may be insufficient for fully representing the dynamics of complex climatic systems.
The station-based analysis highlighted spatial heterogeneity in model performance, emphasizing the influence of local climatic and topographic conditions. Furthermore, the computational analysis indicated that while hybrid models generally required longer execution times than standard MICE, MICE–DT provided a favorable balance between accuracy and computational efficiency. These findings underscore the importance of selecting imputation methods based on both data characteristics and practical constraints.
Conclusion
This study provides a comprehensive evaluation of standard and hybrid MICE-based imputation methods for reconstructing missing climatic data in a multi-variable and multi-station framework. The results demonstrate that incorporating machine learning algorithms within the MICE framework substantially improves reconstruction accuracy, particularly for variables characterized by non-linear behavior. Among the evaluated models, MICE–DT emerged as the most robust and efficient approach, offering consistently high performance across different climatic variables and stations. Despite these strengths, certain limitations were identified. The performance of some hybrid models, particularly MICE–KNN and MICE–SVM, showed sensitivity to station-specific conditions and increased computational demand, which may limit their applicability in large-scale studies. These findings suggest that no single imputation method is universally optimal, and model selection should be tailored to the characteristics of the dataset and research objectives. From a practical perspective, the proposed framework provides valuable guidance for researchers and practitioners seeking reliable methods for handling missing climatic data. The results have direct implications for hydrological modeling, climate trend analysis, and environmental impact assessments in data-scarce regions. Future research should explore the integration of deep learning approaches within the MICE framework and assess model performance under varying missing-data scenarios and spatial scales.

Keywords

Main Subjects



Articles in Press, Accepted Manuscript
Available Online from 10 March 2026
  • Receive Date: 26 January 2026
  • Revise Date: 10 March 2026
  • Accept Date: 10 March 2026