Evaluating the efficiency of dimensionality reduction methods in improving the accuracy of water quality index modeling using machine learning algorithms

Document Type : Research/Original/Regular Article

Authors

1 Dr Mohammad Taghi Sattari Associate Professor, Department of Water Engineering Faculty of Agriculture, University of Tabriz, Tabriz, Iran

2 PhD student, Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran

3 Masters student, Department of Water Engineering, Faculty of Agriculture, University of Tabriz, Tabriz, Iran

Abstract

Knowledge of water quality is considered one of the most important needs in planning, development and protection of water resources. Determining the quality of water for different uses, including irrigation and drinking in different areas of life. The use of modern data mining methods can be beneficial for predicting and classifying the quality of provider water. In the current study, the water quality of Qizil Uzen River was evaluated at Qere Guney stations. In this regard, drinking water quality index (WQI) using the chemical compounds of glass hardness, alkalinity (PH), electrical conductivity, total dissolved substances, calcium, sodium, magnesium, potassium, chlorine, carbonate, bicarbonate and sulfate in the statistical period of 21 years (2000-2020) was estimated. Due to the relatively large number of methods of principal component analysis and independent component analysis to reduce the use and then different machine algorithms including decision tree, logistic regression and multilayer perceptron artificial neural network are used for water quality index models. By using these methods, the number of indicators needed to calculate the quality index was reduced from 12 to 2. Reducing costs saves time for sampling, monitoring samples and determining water quality, and significantly reduces the costs required for modeling. The results showed that among the post-reduction methods, the principal component analysis method can perform better than the independent component analysis method. Also, the results showed that among the methods used in the modeling, the multilayer perceptron neural network method using principal component analysis with an explanatory factor of 0.99, a square root error of 44.79 and a modified Wilmot multiplier equal to 0.99 had the best performance. Due to the fact that the large dimensions of data in the investigation and modeling of water quality cause the modeling process to be complicated and time-consuming, it is therefore recommended to use dimensionality reduction methods such as principal component analysis to reduce the dimensions of the data.

The purpose of this article is to investigate the water quality of Qazil-uzen River at Qere Guneystation in the statistical period of 21 years (1378-1398) based on the water quality index using data related to water quality parameters. In order to reduce the input dimensions of machine learning models, two advanced statistical techniques, principal component analysis (PCA) and independent component analysis (ICA) have been used and their performance has been evaluated. Then, machine learning algorithms including decision tree, logistic regression and multilayer perceptron neural network were used to model water quality index.In general, rivers are the most important natural resources. The development of urbanization and the increase in pollution caused by the discharge of all kinds of urban, industrial and agricultural wastewaters, leachate from landfills have changed and degraded the water quality of rivers. In the present research, for the modeling of the water quality index in the first stage, in order to reduce the time and cost of implementation, different dimensionality reduction methods such as PCA and ICA were used. In the second stage, machine learning methods such as: decision tree, linear regression and multilayer perceptron were used. In the method used by Tripathi and his colleagues, by using the principal component analysis method, they reduced the number of parameters needed to calculate the quality index from 28 to 9 and calculated the water quality index with the number of 9 parameters. Examining the two methods of PCA and ICA has reduced the dimensions of the problem from 12 dimensions to 2 dimensions. The comparison of the results of the models was done using different numerical and graphical evaluation criteria, including: R^2, RMSE and modified Wilmot coefficient as numerical criteria and Taylor diagram as graphical criteria.The results show that the PCA method can help us improve the performance with little cost and high accuracy. Because the PCA algorithm can help reduce noise in data, feature selection, and generate independent and unrelated features from data. This has made it perform better than the ICA method. The results show that multi-layer perceptron, decision tree, and logistic regression methods perform the water quality index with high accuracy. In this research, for the first time, using the ICA dimension reduction algorithm, while reducing the dimensions of the problem, the water quality index is predicted with an accuracy of over 90%. The main drawback of the methods used in this research is that they cannot be generalized to all regions with different climates. According to the mentioned results, it is suggested that the methods used in the present study should be investigated in watersheds with different climates and the best method should be determined for each climate.The best performance is related to the PCA algorithm for dimensionality reduction and the use of the MLP algorithm for modeling the quality value of the water index. Because the value of the standard deviation of the observation data and the standard deviation of the estimated data from the model are close to each other and have provided acceptable results. Villoni's diagram also shows that the results of the models are close and acceptable.

Keywords

Main Subjects



Articles in Press, Accepted Manuscript
Available Online from 02 April 2023
  • Receive Date: 27 February 2023
  • Revise Date: 02 April 2023
  • Accept Date: 02 April 2023