Earth Systems and Environment , 01/01/2025

Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions

Sirimon Pinthong, Nureehan Salaeh, Quoc Bao Pham, Warit Wipulanusat, Uruya Weesakul, Nukul Suksuwan, Van Nam Thai, Shuraik Kader, Aqil Tariq, Pakorn Ditthakit

Abstract

Hydrological experts face substantial challenges in obtaining reliable data imputations due to the prevalence of incomplete rainfall data in many regions. This study presents a novel systematic framework for optimally imputing daily rainfall by integrating ERA5-Land data with observational data, leveraging tree-based machine learning algorithms. In Thailand’s southern basin (TSB), the framework is divided into five main steps: data collection, regionalization (clustering and homogeneity analysis), feature selection, model development (hyperparameter optimization and model training and testing), and performance comparison. The key findings reveal that regionalization, used as a preliminary feature selection step, enhanced data homogeneity and identified three clusters for the TSB dataset, as verified by the Fligner–Killeen and Brown–Forsythe tests. ERA5-Land significantly overestimates precipitation, particularly during high-rainfall periods, but quantile transformation (QT) effectively corrects these biases, aligning ERA5-Land distributions with observations and improving accuracy, especially at lower quantiles. Feature selection comparisons revealed that the genetic algorithm (GA) retained more features, whereas BorutaShap identified critical features, reducing redundancy and achieving slightly better performance, particularly with random forest (RF). Hyperparameter tuning revealed that simpler models such as RF and extra trees (ET) performed well even with default settings, whereas extreme gradient boosting (XGBoost) required precise tuning to maximize performance. Model performance evaluation revealed that QT-corrected ERA5-Land data significantly improved the imputation accuracy, with ET outperforming RF and XGBoost even under high levels of missing data. This study highlights the critical role of integrating bias-corrected datasets and advanced Machine Learning (ML) models for rainfall imputation in data-scarce regions. The proposed framework offers a scalable and reproducible methodology that can be adapted to other areas facing similar challenges, providing the global scientific community with a practical solution for enhancing hydrological data reliability and improving water resource management strategies.

Document Type

Article

Source Type

Journal

Keywords

BorutaShapERA5-LandGenetic algorithm optimizationQuantile transformationRainfall imputationRegionalizationTree-based machine learning

ASJC Subject Area

Earth and Planetary Sciences : GeologyEarth and Planetary Sciences : Economic GeologyEarth and Planetary Sciences : Computers in Earth SciencesEnvironmental Science : Global and Planetary ChangeEnvironmental Science : Environmental Science (miscellaneous)

Funding Agency

Walailak University


Bibliography


Pinthong, S., Salaeh, N., Pham, Q., Wipulanusat, W., Weesakul, U., Suksuwan, N., Nam Thai, V., ... Ditthakit, P. (2025). Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions. Earth Systems and Environmentdoi:10.1007/s41748-025-00787-9

Copy | Save