Development and validation of an explainable prediction model for schistosomiasis seropositivity: a population-based screening study in Hunan Province, China

Human schistosomiasis is a serious parasitic infection that affects hundreds of millions of people worldwide and has a significant impact on public health and socio-economic development (Buonfrate et al., 2025). The disease can lead to serious complications such as liver damage, renal dysfunction and gastrointestinal problems, causing a heavy burden on families and society (Jiang et al., 2023, LoVerde, 2019).

Reliable diagnostic methods are critical for the effective prevention and control of schistosomiasis, as well as for accurately identifying the target populations for treatment (Ally et al., 2024, Chen et al., 2021, Menezes et al., 2023). The World Health Organization (WHO) currently recommends the Kato-Katz (KK) technique for confirming intestinal schistosomiasis (Magalhaes et al., 2020). However, as the prevalence of the disease has significantly declined, this fecal examination test has become less suitable due to its low sensitivity, high false negative rates, and poor compliance among populations (Silva-Moraes et al., 2019, Weerakoon et al., 2015). The Indirect Hemagglutination Assay (IHA) has become the core screening technique for schistosomiasis due to its high sensitivity, simplicity, and time efficiency (Wu, 2002, Zhu, 2005). Studies have indicated that the infection rate detected by IHA is approximately 4 to 5 times greater than that identified by the Kato-Katz method (IHA vs. KK prevalence ratio: 4.72, 95 % CI: 3.87–5.76) (Deng et al., 2018). Individuals who test seropositive may be infected with schistosomiasis and have the potential to become sources of transmission (LoVerde, 2019).

Many machine learning (ML) techniques have been applied to predict schistosomiasis, including advanced schistosomiasis, seropositivity rates, and intermediate host snails (Jiang et al., 2021, Tabo et al., 2024, Xu et al., 2024, Zhou et al., 2024). Although these ML models demonstrating good predictive performance, they are often constrained by their lack of transparent interpretability, which is called “black-box”. For example, they only provide feature importance, but do not specify how these features drive schistosomiasis occurrence. Moreover, predictions at the individual level are currently lacking, with most studies using population prevalence as the predicted outcome. We also note that very few ML models for predicting schistosomiasis utilize external validation and are built for practical application. The SHapley Additive explanations (SHAP) technique is an effective method for explaining the outputs of ML models, as demonstrated in earlier studies (Hu et al., 2024, Lundberg and Lee, 2017, Tang et al., 2024). However, there remains a notable lack of research applying explainable machine learning to predict schistosomiasis. To address current research gap, we applied this explainable method to predict schistosomiasis seropositivity based on large-scale individual serological data, interpreting the roles of predictive variables at both global and individual levels. We further utilized external validation and constructed a predictive application. To our knowledge, the combination of explainable modeling and real-time deployment remains rare in this field.

Previous studies on schistosomiasis primarily focused on environmental and socioeconomic variables for prediction at the regional level, such as temperature, precipitation, and gross domestic product (GDP), lacking the inclusion of comprehensive and finer-scale factors (Chen et al., 2024, Xu et al., 2024). In this study, we systematically examined multiple factors influencing schistosomiasis, including demographic characteristics, behavioral factors, endemicity types and risk levels of villages. We selected five tree-based machine learning algorithms for prediction: Random Forest (RF), Decision Tree (DT), LightGBM, CatBoost, and XGBoost, as these models have been proven by other studies to perform well in terms of accuracy, computational efficiency, and handling large-scale datasets. These models are suitable for analyzing the multivariate and large-scale datasets used in our study (Hu and Li, 2022, Sadig et al., 2025, Wang et al., 2025).

Hunan Province is one of the most serious areas of schistosomiasis epidemic in China (McManus et al., 2011). The prevalence in some areas of the province had reached over 50 % by the early 1950s. Since 2004, China has implemented integrated control strategy focusing on infection source control. This strategy has led to significant achievements in controlling schistosomiasis transmission (Zhou et al., 2021). In 2015, Hunan Province achieved the transmission control standard, with the infection rate among residents falling below 1 % (Li et al., 2020, Xu et al., 2025). However, the challenge of eliminating schistosomiasis remains due to the limitations of pathogen detection, the presence of animal hosts, and the large number of lakes in the province (Lo et al., 2022, Wang et al., 2020). Under current low-prevalence conditions, our model aims to provide a simpler, accurate, and highly acceptable method for identifying high-risk individuals for schistosomiasis, thereby achieving the goal of early prevention and disease elimination.

Comments (0)

No login
gif