Evaluating the efficiency of dimensionality reduction methods in improving the accuracy of water quality index modeling in Qizil-Uzen River using machine learning algorithms

Document Type : Research/Original/Regular Article

Authors

1 Dr Mohammad Taghi Sattari Associate Professor, Department of Water Engineering Faculty of Agriculture, University of Tabriz, Tabriz, Iran

2 PhD student, Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran

3 Masters student, Department of Water Engineering, Faculty of Agriculture, University of Tabriz, Tabriz, Iran

Abstract

Introduction
Water quality assessment is paramount for various sectors, including environmental planning, public health, and industrial operations. With the increasing importance of ensuring safe water sources, especially for drinking and irrigation purposes, modern methodologies like data mining offer valuable tools for predictive analysis and classification of water quality. Knowledge of water quality is considered one of the most important needs in planning, developing, and protecting water resources. Determining the quality of water for different uses, including irrigation and drinking in different areas of life. The use of modern data mining methods can be beneficial for predicting and classifying the quality of provider water. In the current study, the water quality of the Qizil-Uzen River was evaluated at Qara Gunei stations. In this regard, the drinking water quality index (WQI) using the chemical compounds of glass hardness, alkalinity (PH), electrical conductivity, total dissolved substances, calcium, sodium, magnesium, potassium, chlorine, carbonate, bicarbonate and sulfate in the statistical period of 21 years (2000-2020) was estimated. Water quality assessment is paramount for various sectors, including environmental planning, public health, and industrial operations. With the increasing importance of ensuring safe water sources, especially for drinking and irrigation purposes, modern methodologies like data mining offer valuable tools for predictive analysis and classification of water quality.
 
Materials and Methods
Due to the relatively large number of variables, principal component analysis and independent component analysis methods were used to reduce dimensions, and then different machine learning algorithms including decision tree, logistic regression, and multi-layer perceptron artificial neural network were used to model the water quality index. By using these methods, the number of parameters needed to calculate the quality index was reduced from 12 to 2. Reducing the dimensions of the data saves the time of sampling, monitoring the samples, and determining the quality of the water and reduces the costs required for modeling to a significant amount. The results showed that among the dimensionality reduction methods, the principal component analysis method can perform better than the independent component analysis method. In the current research, the WQI index was modeled using machine learning algorithms including decision tree, logistic regression, and artificial neural network method. The quality of water in the Qizil-Uzen Qara Gunei river station has been evaluated. Then, to estimate the numerical values of the WQI index, TH, pH, EC, TDS, Ca, Na, Mg, K, Cl, CO3, HCO3, and SO4 parameters of the mentioned station in the statistical period of 21 years (1378-1398) were used. PCA and ICA methods have been used to select different input parameters. Modeling has been done in a Python programming environment. Among the available samples, 75% are considered for training and 25% for testing.
 
Results and Discussion
In the present research, to model the water quality index in the first stage, different dimensionality reduction methods such as PCA and ICA were used to reduce the time and cost of implementation. In the second stage, machine learning methods such as decision tree, linear regression, and multilayer perceptron were used. In the method used by Tripathi and his colleagues, by using the principal component analysis method, they reduced the number of parameters needed to calculate the quality index from 28 to 9 and calculated the water quality index with the number of 9 parameters. Examining the two methods of PCA and ICA has reduced the dimensions of the problem from 12 dimensions to 2 dimensions. The results show that the PCA method can help us improve performance with little cost and high accuracy. Because of the PCA dimensions. The comparison of the results of the models was done using different numerical and graphical evaluation criteria, including R2, RMSE, and modified Wilmot coefficient as numerical criteria and Taylor diagram as graphical criteria. Because the PCA algorithm can help reduce noise in data, feature selection, and generate independent and unrelated features from data. The results show that multi-layer perceptron, decision tree, and logistic regression methods accurately perform the water quality index. In this research, for the first time, using the ICA dimension reduction algorithm, while reducing the dimensions of the problem, the water quality index is predicted with an accuracy of over 90%.
 
Conclusion
Water quality index modeling holds significant relevance in agricultural practices, where access to clean water is crucial for irrigation and crop growth. Surprisingly, only a limited number of studies have explored variable reduction methods in water quality index modeling, with none incorporating the relatively novel Independent Component Analysis (ICA) method for dimensionality reduction. Thus, the current research fills this gap by employing PCA and ICA techniques to reduce the dimensionality of large datasets in water quality index modeling. By utilizing these advanced methods, the study aims to enhance efficiency and accuracy in assessing water quality, thereby offering valuable insights for agricultural water management. Following dimensionality reduction, the dataset is then subjected to modeling using various machine learning algorithms. This approach not only optimizes computational resources but also facilitates a deeper understanding of the complex interrelationships among water quality parameters. Through this pioneering research endeavor, the efficacy of ICA alongside PCA in addressing water quality index modeling challenges is evaluated. By integrating these techniques with machine learning methodologies, the study endeavors to provide actionable intelligence for agricultural stakeholders, aiding in informed decision-making and resource allocation. Moreover, by venturing into unexplored territory with the inclusion of ICA, the research contributes to expanding the methodological toolkit available for water quality assessment. As agriculture faces increasing pressure from climate change and resource scarcity, such innovative approaches hold promise in ensuring sustainable water management practices.

Keywords

Main Subjects


References
Ajayram, K.A., Jegadeeshwaran, R., Sakthivel, G., Sivakumar, R., Patange, A.D. (2021). Condition monitoring of carbide and non-carbide coated tool insert using decision tree and random tree – A statistical learning. Materials Today: Proceedings, doi:10.1016/j.matpr.2021.02.065.
Al-Mukhtar, M., & Al-Yaseen, F. (2019). Modeling water quality parameters using data-driven models, a case study Abu-Ziriq marsh in south of Iraq. Hydrology, 6(1), 24. doi:10.3390/
hydrology6010024
 Boyacioglu, H. (2007). Development of a water quality index based on a European classification scheme. Water SA, 33(1). doi: 10.4314/wsa.
v33i1.47882
Bailey, D., & Solomon, G. (2004). Pollution prevention at ports: clearing the air. Environmental Impact Assessment review24(7-8), 749-774. doi:10.1016/j.eiar.2004.06.005
Chen, K., Chen, H., Zhou, C., Huang, Y., Qi, X., Shen, R., Liu, F., Zuo, M., Wang, J., Zhang, Y., Chen, D., Chen, X., Deng, Y., & Ren, H. (2020). Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Research, 171, 115454. doi:10.1016/j.watres.
2019.115454.
Coutsias, E.A., Seok, C., & Dill, K.A. (2004). Using quaternions to calculate RMSD. Journal of computational chemistry25(15), 1849-1857. doi:10.1002/jcc.20110
Daffertshofer, A., Lamoth, C.J., Meijer, O.G., & Beek, P.J. (2004). PCA in studying coordination and variability: a tutorial. Clinical biomechanics19(4), 415-428. doi:10.1016/
j.clinbiomech.2004.01.005
Denil, M., Matheson, D., de Freitas, N. (2014). Narrowing the gap: Random forests in theory and in practice. Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 32(1), 665-673. doi:10.48550/
arXiv.1310.1415
Dezfooli, D., Mooghari, S.M.H., Ebrahimi, K., & Araghinejad, S. (2017). Water quality classification based on minimum qualitative parameter (Case Study: Karun River). Journal Of Natural Environment, 70(3), 583-595. https://sid.ir/paper/195087/en. [In Persian]
Gorde, S.P., & Jadhav, M.V. (2013). Assessment of water quality parameters: a review. Journal of Engineering Research and Applications3(6), 2029-2035. https://www.ijera.com/papers/Vol3_
issue6/LV3620292035.pdf
Hintze, J.L. & Nelson, R.D. (1998). Violin plots: A box plot-density trace synergism. The American Statistician, 52, 181-184. doi: 10.2307/268547
Islam Khan, D.S., Islam, N., Uddin, J., Islam, S., Nasir, M.K. (2021). Water quality prediction and classification based on principal component regression and gradient boosting classifier approach. Journal of King Saud University-Computer and Information Sciences, 34(8), 4773-4781. doi: 10.1016/ j.jksuci.2021.06.003.
Icaga, Y. (2007). Fuzzy evaluation of water quality classification. Ecological Indicators, 7(3), 710-718. doi:10.1016/j.ecolind.2006.08.002
Jie, Z., Xiaoli, L., & Juntao, L. (2016). Fresh food distribution center storage allocation strategy analysis based on optimized entry-item-quantity-ABC. International Journal on Data Science Technology, 36-40. doi: 10.11648/j.ijdst.20160
203.11
Johnson, O., Akinola, S., Aboyeji, O., Adedeji, A. (2021). Comparison between fuzzy logic and water quality index methods: A case of water quality assessment in Ikare community, Southwestern Nigeria. Environmental Challenges, 3, 1-10. doi:10.1016/j.envc.2021.100038.
Kalmegh, S. (2015). Analysis of WEKA data mining algorithm REPTree, simple cart and randomtree for classification of Indian News. International Journal of Innovative Science, Engineering & Technology, 2, 438-446.
Khoi, D.N., Quan, N.T., Linh, D.Q., Nhi, P.T.T., Thuy, N.T.D. (2022). Using machine learning models for predicting the water quality index in the La Buong River, Vietnam. Water, 14. doi: 10.3390/w14101552.
Khalili, R., Montaseri, H., Motaghi, H., & Jalili, M. B. (2021). Water quality assessment of the Talar River in Mazandaran Province based on a combination of water quality indicators and multivariate modeling. Water and Soil Management and Modelling, 1(4), 30-47. doi: 10.22098/mmws.2021.9322.1033  [In Persian]
La Valley, M.P. (2008). Logistic regression. Circulation117(18), 23952399. doi:10.1161/
CIRCULATIONAHA.106.682658
Massoud, M.A. (2012). Assessment of water quality along a recreational section of the Damour River in Lebanon using the water quality index. Environmental Monitoring and Assessment, 184, 4151-4160. doi:10.1007
/s10661-011-2251-zt
Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta numerica8, 143-195. doi:10.1017/S0962492900002919.
Soleimanpour, S.M., Mesbah, S.H., Hedayati, B. (2018). Application of CART decision tree data mining to determine the most effective drinking water quality factors (case study: Kazeroon plain, Fars province). Iranian Journal of Health and Environment, 11(1), 1-14. http://ijhe.tums.ac.ir/article-1-5881-en.html. [In Persian]
Sattari, M.T., Feizi, H., Colac, M., Ozturk, A., Ozturk, F., & Apaydin, H. (2021). Surface water quality classification using data mining approaches Irrigation along the Aladag River. Irrigation and Drainage, 70(5), 1227–1246. doi:10. 1002/ird.2594.
Tripathi, M., Singal, S. (2019). Use of principal component analysis for parameter selection for development of a novel water quality index: A case study of river Ganga India. Ecological Indicators, 96, 430-436. doi:10.1016/j.ecolind.2018.09.025.
Taylor, K.E. (2001). Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research: Atmospheres106(7), 7183-7192. doi:10.1029/2000JD900719
Willmott, C.J., Robeson, S.M., & Matsuura, K. (2012). A refined index of model performance. International Journal of Climatology32(13), 2088-2094. doi:10.1002/joc.2419
World Health Organization. (2010). Hardness in drinking-water: background document for development of WHO guidelines for drinking-water quality (No. WHO/HSE/WSH/10.01/10).
Yusri, H., Ab Rahim, A., Hassan, S., Halim, I., & Abdullah, N. (2022). Water quality classification using SVM and XGBoost method, IEEE 13th Control and System Graduate Research Colloquium (ICSGRC). 231-236. doi: 10.1109/ICSGRC55096.2022.9845143.
Zeinalzadeh, K., & Rezaei, E. (2017). Determining spatial and temporal changes of surface water quality using principal component analysis. Journal of Hydrology: Regional Studies13, 1-10. doi:10.1016/j.ejrh.2017.07.002