Introduction
For the Earth to remain green and habitable, water is necessary. Effective management of water resources is essential to every country’s progress. The UN General Assembly acknowledged access to clean drinking and recreational water in 2008.1 Consuming water tainted with bacteria, viruses, and heavy metals may spread a number of illnesses with high death and morbidity rates, including cholera, dysentery, diarrhea, skin cancer, and typhoid. Since these illnesses make up 80% of all diseases that have been discovered so far, they have important.2 In Bihar’s cities and rural areas, groundwater is a vital resource for residential, commercial, and agricultural uses.3 Due to considerable industrial expansion and population increase, the need for water is increasing daily.4, 5 However, one of the biggest environmental problems facing any country today is the supply of fresh water, which is a result of several human activities. One of the main concerns is groundwater contamination.6 Groundwater quality is significantly impacted, particularly in India, by the discharge of untreated sewage, poor solid waste management, widespread use of pesticides and fertilizers in agriculture, overexploitation of groundwater, and changes in land use and land cover.7, 8 The geography of the area, soil characteristics, groundwater and rock interactions, and climate may all have an impact on groundwater quality in addition to pollution.9 In light of the previously described elements, it is now essential to assess the quality of groundwater and carry out corrective. Contamination of drinking water may happen during treatment and transportation processes or at the source as a result of heavy metals and minerals (geogenic). With careful monitoring and targeted water quality management techniques, the latter element may be controlled. However, when naturally occurring heavy metal concentrations above allowable limits, pollution becomes a significant problem. About seventeen of the fifty heavy metals are known to cause cancer, with lead, arsenic, mercury, cadmium, and vanadium being the most hazardous.10 Anthropogenic and natural sources of heavy metals seep into groundwater, causing environmental issues around the globe. One dangerous water contaminant is arsenic (As).11 Because arsenic is a trace element and may be released by geogenic or human activity, it is important to identify and quantify the hazards of consuming it through the skin, inhalation, and ingestion. Arsenic exhibits four oxidation states, which are designated as arsine, arsenic, arsenite, and arsenate. As (III) and As (V) are more commonly found in groundwater and surface water, respectively. Its solubility depends on the pH and ionic environment.12 In general, As (V) is less hazardous than As (III), and organic forms of arsenic are less dangerous than inorganic ones.13 Arsenic oxides are recognized to have the potential to be more hazardous than other arsenic compounds. Even when taken at smaller amounts over longer periods of time, many substances have a higher toxicity index.14 Significant non-carcinogenic hazards, including gangrene, keratosis, hyperpigmentation, black foot disease, and vascular disorders, as well as carcinogenic risks, such cancer, have been linked to long-term chronic exposure to inorganic arsenic.15 The International Agency for Research on Cancer (IARC) has designated arsenic as a class I human carcinogenic metalloid due to the potentially fatal effects of exposure.16 As per IS 10500:2012 permissible limit of arsenic is 0.01 mg/L17 Because of this, arsenic is currently regarded as a dangerous material worldwide, posing a number of immediate and long-term health risks. The incidence of diseases like cancer has increased significantly. A number of things may infect humans with arsenic, including food, beverages, and inadvertently.18 Drinking water is considered a significant exposure route for arsenic when compared to other ingestible channels, such as cutaneous and inhalation. Since arsenic in India’s arsenic-rich groundwater is causing increasing worries about both human health and the environment, it is imperative to evaluate the associated health risks.19
Rapid urbanization and excessive water usage are now occurring in the research locations that were chosen. Understanding the relationships between arsenic levels and other medium characteristics is thus crucial. Since it is highly difficult to acquire and quantify arsenic samples, a variety of machine learning models were created, and their effectiveness in predicting arsenic concentrations based on other input characteristics was evaluated.
Material and Methods
Study Area
One of India’s most fertile and highly inhabited areas is the Gangetic Plains. There are over 83 million people residing in the Middle and partially Upper Ganga Plains of Bihar, which has an area of about 94,163 km². The Mid-Gangetic Plains and the floodplains of the Sone and Gandak rivers, two of its tributaries, were the study’s locations. To achieve this, samples were drawn at random from tube wells close to and distant from the Ganga River’s banks at locations on both banks. Samples were specifically taken from Begusarai, with an emphasis on the blocks of Matihani and Cheria Bariarpur. Fig. 1 below shows the map that describes the research area:
Geology and Hydrogeology
The district of Begusarai is located on the Ganga River’s northern bank. This district is traversed by the Burhi Gandak, Balan, Bainty, Baya, and Chandrabhaga rivers. The district is 1,918 square kilometers in size and is home to 2,954,367 people. Rainfall in the area averages 138.4 cm per year. Temperatures may dip to as low as 8° C in the winter, while they can rise as high as 40° C in the summer. The primary source for the people living in the Begusarai area for agricultural and domestic purposes is groundwater. For this research, the blocks of Cheria Bariarpur and Matihani were chosen in order to measure the amounts of arsenic in regions both close to and far from the Ganga River’s banks. While Cheria Bariarpur block is close to the rivers Burhi Gandak, Balan, Bainty, Baya, and Chandrabhaga, Matihani block is situated on the Ganga River’s bank.
Sample collection and analysis
From the tube wells (depth 60-300 feet) in the Begusarai district (Matihani and Cheria Bariarpur Block), a total of 100 samples were taken. (Figure 1). The sample protocol outlined in [20] was adhered to. Prior to sampling, the tube well’s stagnant water was eliminated. Groundwater samples were collected using pre-cleaned polyethylene bottles that had been rinsed with fresh water and demineralized water after being cleaned with a 10% nitric acid solution. Prior to sample collection, the sampling vials were washed with the sample water. EUTECH’s multi-parameter PCSTestr 35 was used on-site to measure the sample’s pH. To determine the presence of heavy metals, samples were taken from each tube well and acidified using concentrated HNO3 to bring the pH down to 2. After being collected, the samples were placed in an ice box and brought to the lab, where they were maintained at 4 °C. Prior to laboratory examination, samples were filtered via a 0.45-µm pore-size membrane. An Agilent 5110 ICP-OES inductively coupled plasma emission spectrometer was used to measure the levels of heavy metals (iron and arsenic).
Model Development
Scaling or Normalization
When the input data has different scales, machine learning algorithms often perform badly.21 Consequently, Eqn. 1 was used to scale the input variables (from 0 to 1):

Thirty percent of the dataset was put aside for testing, while seventy percent was used to train the machine learning models in order to adequately assess the models. The arsenic prediction was tested against the observed values after measured water quality parameters, aside from the quantity of arsenic, were specified as input variables. Python was used to implement the different models. The theoretical ideas and outcomes of the varoius machine learning models are shown and discussed below.
Random forest regressor model (RFR)
Breiman (2001) created the Random Forest Regressor Model (RFR), an ensemble technique that builds many decision trees using random sampling with replacement for making repetitive target variable prediction. The method is presented in Fig. 2 and is explained by [21], utilizes the performance of many decision tree algorithms to predict the level of arsenic. A training subset chosen at random with replacement is used by each decision tree, and this subset is repeated as many times as there are trees in the ensemble. A final forecast is then generated by combining the results of various decision trees.22 Using bootstrap sampling, each tree is constructed using a random subsample of the training dataset; the samples that are excluded from this subsample are known as “Out of Bag” (OOB) samples.23 The RFR model is internally cross-validated using these OOB samples.24 Hyperparameter tweaking was used to improve the prediction of arsenic concentration levels after the first RFR model was developed.
Decision Tree regressor model (DTR)
Regression and classification issues are addressed using non-parametric machine learning methods known as decision trees (DTs). Unlike black-box algorithms, DTs feature an easy-to-understand decision-making process and are very intuitive [21]. Before rendering decisions based on a range of input variables organized into layers of decision branches, the algorithm begins at a root node and proceeds via internal and terminal nodes. Following the first split of the data into two subsets, the decision tree (DT) uses the same reasoning to divide each subsequent subset recursively. Until the maximum designated depth is achieved or no further splits that minimize the loss function can be identified, this procedure is repeated [25]. Decision trees (DTs) are widely used classifiers and regressors for constructing binary classification and regression, respectively, because of their simplicity and interpretability. Compared to other algorithms, DTs handle numerical and categorical data efficiently and with fewer assumptions.
Results and discussion
Statistical Summary of Water Quality Parameters
A comprehensive statistical analysis of the groundwater quality from two blocks of Begusarai District—Matihani and Cheria Bariarpur—was performed to understand the distribution of key physicochemical parameters and heavy metals. Table 1 presents the water quality statistics for Matihani Block (N = 50). pH of samples were found (6.7 to 7.9), with a mean value of 7.3 and a standard deviation (SD) of 0.3142, which lies well within the acceptable range (6.5–8.5) specified by the Indian Standard (IS:10500, 2012). However, elevated concentrations of arsenic (As) were observed, ranging from 0.0163 µg/L to 0.203 µg/L, with a mean of 0.129 µg/L. This mean concentration significantly exceeds the permissible limit of 0.01 µg/L, suggesting a potential public health risk due to long-term exposure. Iron (Fe) levels in the water ranged from 0.0031 µg/L to 0.398 µg/L, with an average concentration of 0.0412 µg/L, which remains within the acceptable limit of 0.3 µg/L.
Table 1: Statistical description of Matihani Block, Begusarai District, N=50.
Parameters Min. Max. Mean S.D. Acceptable limit IS:10500, 2012 Physicochemical Parameter pH 6.7 7.9 7.3 0.3142 6.5-8.5 Heavy metals (µg/L) As 0.0163 0.203 0.1290 0.0433 0.01 Fe 0.0031 0.398 0.0412 0.0632 0.3Table 2: Statistical description of Cheria Bariarpur block, Begusarai District, N=36.
Parameters Min. Max. Mean S.D. Acceptable limit IS:10500, 2012 Physicochemical Parameters pH 6.5 7.9 7.275 0.297 6.5-8.5 Heavy metals (µg/L) As 0.0007 0.5900 0.1656 0.1299 0.01 Fe 0.0020 0.2080 0.0600 0.0550 0.30Similarly, the groundwater quality data for Cheria Bariarpur Block (N = 36) is summarized in Table 2. The pH values ranged from 6.5 to 7.9, with a mean of 7.275 and a SD of 0.297, also indicating a neutral to slightly basic nature of the water and compliance with the IS standard. Average arsenic concentration was found to be 0.1656 µg/L, with values as high as 0.5900 µg/L, far exceeding the acceptable limit. This block exhibits an even higher mean arsenic level than Matihani, highlighting a more severe contamination issue. The mean concentration of iron was 0.0600 µg/L, varying between 0.0020 µg/L and 0.2080 µg/L, which, like Matihani, remained within permissible limits.
These results indicate that while the general physicochemical characteristics of groundwater (such as pH) are within acceptable bounds, the widespread occurrence of arsenic contamination in both blocks poses a significant environmental and health concern. The spatial variability in arsenic concentrations suggests possible geogenic sources, such as the dissolution of arsenic-bearing minerals, although anthropogenic contributions cannot be ruled out. Such findings underscore the necessity for regular monitoring and the implementation of suitable mitigation strategies to ensure safe drinking water for the affected population.
Evaluation of Machine Learning Model Performance
To predict groundwater quality and associated risk parameters with improved accuracy, multiple machine learning (ML) models were employed, and their performance was evaluated based on standard statistical indices, namely Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Nash–Sutcliffe Efficiency (NSE), and the coefficient of determination (R²). Table 3 summarizes the comparative performance of the Decision Tree Regressor (DTR) and the Random Forest Regressor (RFR).
Table 3: Performance of various ML models
Models Performance criteria (Accuracy for testing sets) MSE RMSE NSE R2 DTRM 0.002356 0.048538 0.874 0.874 RFRM 0.001867 0.043208 0.895 0.895
The DTR exhibited a MSE of 0.002356 and an RMSE of 0.048538, with both NSE and R² values standing at 0.874. These metrics indicate that DTRM can capture nonlinear relationships in the dataset reasonably well. However, the RFRM outperformed the DTRM, achieving a lower MSE (0.001867) and RMSE (0.043208), along with higher NSE and R² values of 0.895. The superior performance of RFRM can be attributed to its ensemble-based nature, which reduces overfitting and enhances generalization by aggregating predictions from multiple decision trees.
Overall, the results suggest that Random Forest is a more reliable and robust model for predicting groundwater quality parameters in arsenic-affected regions. Its improved accuracy and predictive efficiency make it a valuable tool for water quality assessment and management. Integrating such data-driven approaches with field-level water testing can significantly improve decision-making processes for public health and resource planning.
Conclusions
This study presents a comprehensive assessment of groundwater quality in two blocks of Begusarai District, Bihar—Matihani and Cheria Bariarpur—focusing on key physicochemical parameters and toxic heavy metals. While pH levels in both regions remain within the acceptable range defined by IS: 10500 (2012), arsenic concentrations significantly exceed permissible limits, posing serious health risks. Iron levels were found to be within safe boundaries, although variations were observed across locations. The elevated arsenic contamination suggests a geogenic origin, but the possibility of anthropogenic contributions cannot be dismissed. To enhance groundwater quality prediction, machine learning models were employed and evaluated. Among the tested models, the Random Forest Regression Model (RFRM) demonstrated superior performance with lower error values and higher predictive efficiency compared to the Decision Tree Regression Model (DTRM). These findings highlight the potential of ensemble learning techniques in supporting water quality monitoring and risk assessment frameworks. In conclusion, the combination of statistical analysis and machine learning provides a reliable methodology for identifying contamination hotspots and forecasting groundwater quality. The study emphasizes the urgent need for targeted mitigation strategies and continuous monitoring, especially in arsenic-prone areas, to ensure the provision of safe drinking water and protect public health.
Acknowledgement
We want to thank the lab facility of NIT Patna and also like to mention that no funding was obtained for this study.
Funding Sources
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Conflict of Interest
The author(s) do not have any conflict of interest.
Data Availability Statement
This statement does not apply to this article.
Ethics Statement
This research did not involve human participants, animal subjects, or any material that requires ethical approval.
References
WHO. 2008. Guidelines for Drinking-Water Quality SECOND ADDENDUM TO THIRD EDITION WHO Library Cataloguing-in-Publication Data. World Health Organization 1: 1–103.http://www.who.int/water_sanitation_health/ dwq/secondaddendum 20081119.pdf. WHO. Safe Water Technology for Arsenic Removal. UNICEF: 2001, 1–22. Kanth, K.M., Singh, S.K., Kashyap, A., Vijay Kumar Gupta, V.K., Shalini, S., Kumari, S., Kumari, R., and Puja, K. Bacteriological assessment of drinking water supplied inside the Government schools of Patna District, Bihar, India. Am. J. Environ. Prot. 2018, 6(1), 10-13. http://pubs.sciepub.com/env/6/1/2/index.html. Sarath Prasanth, S.V., Magesh, N.S., Jitheshlal, K.V., Chandrasekar, N. and Gangadhar, K.J.A.W.S., Evaluation of groundwater quality and its suitability for drinking and agricultural use in the coastal stretch of Alappuzha District, Kerala, India. Appl. Water Sci. 2012, 2, 165-175.
Comments (0)