Comparing the performance of the multiple linear regression classic method and modern data mining methods in annual rainfall modeling (Case study: Ahvaz city)

Document Type : Case-study Article

Authors

1 M.Sc. Student/ Department of Water Engineering, Faculty of Agriculture, Tabriz University, Tabriz, Iran

2 Associate Professor/ Department of Water Engineering, Faculty of Agriculture, Tabriz University, Tabriz, Iran

Abstract

Introduction
Prediction of hydrological variables, especially precipitation, is very important in the management and planning of water resources. For this reason, accurate estimation methods have always been of interest to researchers. Furthermore, due to the water crisis in different regions, it is necessary to use different methods to predict the rainfall and the resulting runoff so that comprehensive and appropriate management can be applied in the field of water distribution. Since the past, various methods have been developed and used by researchers to predict hydrological variables. The use of classical methods such as multiple linear regression to predict hydrological variables, especially precipitation, has been one of the most important and widely used methods that have had good results. Recently, data mining methods have been developed for this purpose. In this research, a comparison between the performance of the classic multiple linear regression and modern data mining methods was made in the annual rainfall modeling of Ahvaz city, and finally the best model in terms of performance was determined.
 
Materials and Methods
In this study, the annual rainfall of Ahvaz city has been investigated and modeled. Meteorological data from Ahvaz station was collected over a period of 30 years (1992-2021). The data validation tests including tests of homogeneity, normality, trend, and outlier data were performed. Annual rainfall modeling of Ahvaz city was done with Multiple Linear Regression (MLR), Principal Component Analysis (PCA), Gene Expression Programming (GEP), and Support Vector Machine (SVM). Finally, using the coefficient of determination (R2), Root Mean Square of Errors (RMSE), Nash-Sutcliffe Efficiency (NSE), and Willmott index (WI), the accuracy and performance of the models were compared.
 
Results and Discussion
In this study, XLSTAT software was used to model rainfall with multiple linear regression. In order to simulate precipitation through the SVM model, it is possible to examine the types of kernel function, among which linear and polynomial kernels of the second and third degree, which are common types used in hydrology, are selected and through trial and error the optimal results of this The type of kernels was calculated. According to these results, the support vector machine model with third degree polynomial kernel was determined as the optimal method of precipitation modeling. In simulating the precipitation process using gene expression programming, because this model has the ability to select more effective variables and eliminate variables with less influence, therefore, in this project, all eight input factors are used to determine meaningful variables and for further investigation, in addition to the set The default mathematical operators of the program (F1), modes based on the values of the four main operators (F2) and the set of operators F3 and F4 have been used.
The results of the validation tests that check the homogeneity, trend, normality, and outlier data showed the good quality of the recorded data and the possibility of using them with a high percentage of confidence to continue the study. The results of comparing the models showed that the methods of PCA and GEP with R2=0.85, NSE=0.85, and WI=0.96 and very little difference in RMSE equal 35.49 and 35.70, respectively. They have predicted the annual rainfall of Ahvaz with better performance and more accuracy compared to other models. Considering the water crisis in different regions of the country, especially in Ahvaz, it is suggested to use the methods introduced in this research to predict rainfall and runoff resulting from it, so that a comprehensive and appropriate management can be applied in the field of water distribution.
 
Conclusion
In this research, a comparison was made between classical statistical methods and some modern data mining methods in forecasting the annual rainfall of Ahvaz city. The hydrological data of Ahvaz synoptic meteorological station was collected in a period of 30 years (1371-1400) and first the data was verified using homogeneity, trend, normality and outlier data tests. The results showed the good quality of the recorded data and the possibility of using them with a high percentage of confidence. Multiple linear regression (MLR), principal component analysis (PCA), gene expression programming (GEP) and support vector machine (SVM) methods were used to model precipitation. The results of running the models were compared using the coefficient of explanation (R2), root mean square errors (RMSE), Nash-Sutcliffe efficiency (NSE) and Wilmot index (WI). The results showed that the methods of principal component analysis and gene expression programming with R2 criteria equal to 0.85, NSE equal to 0.85 and WI equal to 0.96 and a very small difference in RMSE values equal to 35.49 and 35.70, respectively, compared to Other models have better performance and more accuracy.
According to the results of this research, it is suggested to use modern data mining methods in addition to classical statistical methods in future researches. Also, it is necessary to pay attention to the use of functions and optimal factors of models to achieve the best results in future researches. Considering the water crisis in different parts of the country, especially in Ahvaz, it is suggested to use the methods introduced in this research to predict the rainfall and runoff caused by it, so that a comprehensive and appropriate management can be applied in the field of water distribution.

Keywords

Main Subjects


Aftab, S., Ahmad, M., Hameed, N., Bashir, M.S., Ali, I., & Nawaz, Z. (2018). Rainfall prediction in Lahore City using data mining techniques. International Journal of Advanced Computer Science and Applications, 9(4), 254-260. doi:10.1016/j.enbuild.2015.09.073
Amiri, S.S., Mottahedi, M., & Asadi, S. (2015). Using multiple regression analysis to develop energy consumption indicators for commercial buildings in the US. Energy and Buildings, 109, 209-216. doi:10.2307/1267603
Andrews, D.F. (1974). A robust method for multiple linear regression. Technometrics, 16(4), 523-531.
Asakareh, H., & Bayat, A. (2013). The analysis of the trend and the cycles of annual precipitation characteristics of Zanjan. Geography and Planning, 17(45), 121-142. [In Persian]
Baeriswyl, P.A., & Rebetez, M. (1997). Regionalization of precipitation in Switzerland by means of principal component analysis. Theoretical and Applied Climatology, 58(1), 31-41. doi.org/10.1007/BF00867430
Balafoutis, C.J. (1991). Principal component analysis of Albanian rainfall (No. RefW-15-14613). Aristotle University of Thessaloniki.
Cattell, R.B. (1966). The Scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245-276. doi:10.1207/s15327906mbr0102_10
Danandeh Mehr, A.D. (2018). Month ahead rainfall forecasting using gene expression programming. American Journal of Earth and Environmental Sciences, 1(2), 63-70.
Dutta, P.S., & Tahbilder, H. (2014). Prediction of rainfall using data mining technique over Assam. Indian Journal of Computer Science and Engineering (IJCSE), 5(2), 85-90.
Ferreira, C. (2002). Gene expression programming in problem solving. Pp. 635-653, In: Soft computing and industry, Springer, London.
Ghajarnia, N., Liaghat, A., & Arasteh, P.D. (2015). Comparison and evaluation of high-resolution precipitation estimation products in Urmia Basin-Iran. Journal of Water and Soil Resources Conservation, 4(1), 91-109. doi:10.1016/j.atmosres.2015.02.010
Hasan, N., Nath, N.C., & Rasel, R.I. (2015). A support vector regression model for forecasting rainfall. 2nd International Conference on Electrical Information and Communication Technologies (EICT), Pp. 554-559.
Jolliffe, I.T. (1993). Principal component analysis: a beginner's guide-II. Pitfalls, myths and extensions. Weather, 48(8), 246-253. doi:10.1002/j.1477-8696.1993.tb05899.x
Jolliffe, I.T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. doi:10.1098/rsta.2015.0202
Kazemzadeh, M., Malekian, A., Moghaddamnia, A. R., & Sigaroudi, K. (2019). Evaluation of Climate Change Impacts on Hydrological Characteristics of Watershed (Case study: Aji-Chai Watershed). Iranian Journal of Watershed Management Science and Engineering, 13(45), 1-11.  dor:20.1001.1.20089554.1398.13.45.1.5 [In Persian]
Krzywinski, M., & Altman, N. (2015). Multiple linear regression. Nature Methods, 12(12), 1103-1104. doi:10.1038/nmeth.3665
Lu, K., & Wang, L. (2011). A novel nonlinear combination model based on support vector machine for rainfall prediction. 4th International Joint Conference on Computational Sciences and Optimization, Pp. 1343-1346.
Mirabbasi, R., Kisi, O., Sanikhani, H., & Gajbhiye Meshram, S. (2019). Monthly long-term rainfall estimation in Central India using M5Tree, MARS, LSSVR, ANN and GEP models. Neural Computing and Applications, 31(10), 6843-6862. doi:10.1007/s00521-018-3519-9
Nolan, B.T., Fienen, M.N., & Lorenz, D.L. (2015). A statistical learning framework for groundwater nitrate models of the Central Valley, California, USA. Journal of Hydrology, 531, 902-911. doi:10.1016/j.jhydrol.2015.10.025
Pai, P.F., & Hong, W.C. (2007). A recurrent support vector regression model in rainfall forecasting. Hydrological Processes, 21(6), 819-827. doi:10.1002/hyp.6323
Piña-Monarrez, M.R., & Ortiz-Yañez, J.F. (2015). Weibull and lognormal Taguchi analysis using multiple linear regression. Reliability Engineering & System Safety, 144, 244-253. doi:10.1016/j.ress.2015.08.004
Preacher, K.J., Curran, P.J., & Bauer, D.J. (2006). Computational tools for probing interactions in multiple linear regression, multilevel modeling, and latent curve analysis. Journal of Educational and Behavioral Statistics, 31(4), 437-448. doi:10.3102/10769986031004437
Sattari, M. T. & Rezazadeh Judi, A. (2018). Monthly runoff modeling using data mining methods based on feature selection algorithms. Protection of Water and Soil Resources, 7(4), 39-54. [In Persian]
Solgi, A., Zarei, H., Shahni, D.M., & Alidadi, D.K.S. (2018). Application of gene expression programming and support vector regression models to modeling and prediction monthly precipitation. Journal of Geographical Sciences, 18(50), 91-103.  doi:10.29252/jgs.18.50.91 [In Persian]
Sneyers, R., Vandiepenbeeck, M., & Vanlierde, R. (1989). Principal component analysis of Belgian rainfall. Theoretical and applied Climatology. 39(4), 199-204. doi:10.1007/BF00867948
Stathis, D., & Myronidis, D. (2009). Principal component analysis of precipitation in Thessaly region (Central Greece). Global Network of Environmental Science and Technology Journal, 11(4), 467-476.
Steiner, D. (1965). A Multivariate Statistical Approach to Climatic Regionalization and Classification. EJ Brill.
Sureh, F.S., Sattari, M.T., & İrvem, A. (2019). Estimation of monthly precipitation based on machine learning methods by using meteorological variables. Mustafa Kemal Üniversitesi Tarım Bilimleri Dergisi, 24, 149-154.
Swain, S., Patel, P., & Nandi, S. (2017). A multiple linear regression model for precipitation forecasting over Cuttack district, Odisha, India. The 2nd International Conference for Convergence in Technology, Pp. 355-357. doi:10.1109/I2CT.2017.8226150
Tripathi, S., Srinivas, V.V., & Nanjundiah, R.S. (2006). Downscaling of precipitation for climate change scenarios: a support vector machine approach. Journal of Hydrology, 330(3-4), 621-640. doi:10.1016/j.jhydrol.2006.04.030
Vapnik, V.N., & Chervonenkis, A.Y. (2015). On the uniform convergence of relative frequencies of events to their probabilities. Pp. 11-30, In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. doi:10.1007/978-3-319-21852-6_3
Whetton, P.H. (1988). A synoptic climatological analysis of rainfall variability in southeastern Australia. Journal of Climatology, 8(2), 155-177. doi:10.1002/joc.3370080204
Wilks, D.S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press.
Willmott, C.J. (1978). P-mode principal components analysis, grouping and precipitation regions in California. Archives for Meteorology Geophysics and Bioclimatology Series B Theoretical and Applied Climatology, 26(4), 277-295. doi:10.1007/BF02243232
Zaw, W.T., & Naing, T.T. (2008). Empirical statistical modeling of rainfall prediction over Myanmar. International Journal of Computer and Information Engineering, 2(10), 3418-3421. doi:10.5281/zenodo.1084254