As shown in Fig. 1, a total of 2882 IHCs treated with either Peg-IFN monotherapy or in combination with NAs were included in this study. Patients were divided into an HBsAg seroclearance group (n = 541, 18.8%) and a non-clearance group (n = 2341, 81.2%) according to HBsAg seroclearance at week 24 of treatment.
Fig. 1
Flowchart of patient selection and model development
Among the study population, 67.0% were male (1919/2882), the median age was 42 years (interquartile range [IQR]: 36–49 years), and 97.0% were of Han ethnicity (2783/2882). Non-maternal–infant transmission was predominant (91%). The median baseline hepatitis B virus (HBV) DNA level was 0 IU/mL (IQR: 0–136.75), indicating low viral load typical of IHC patients.
As shown in Table 1, patients in the HBsAg seroclearance group had significantly lower baseline HBsAg levels compared to the non-clearance group [median (IQR): 8.46 IU/mL (1.00–59.52) vs. 136 IU/mL (25.87–447.40), p < 0.001]. Additionally, the clearance group had slightly lower body mass index (BMI) [22.86 kg/m2 (21.82–24.22) vs. 23.18 kg/m2 (21.97–24.34), p = 0.02], and slightly higher white blood cell (WBC) counts [5.70 × 10⁹/L (4.73–6.72) vs. 5.49 × 10⁹/L (4.51–6.55), p = 0.01] and absolute neutrophil counts (ANCs) [3.21 × 10⁹/L (2.56–4.01) vs. 3.10 × 10⁹/L (2.42–3.91), p = 0.03]. Alpha-fetoprotein (AFP) levels were lower in the clearance group [2.66 ng/mL (2.03–3.24) vs. 2.75 ng/mL (2.04–3.56), p = 0.01], and the fibrosis-4 index (FIB-4) was also slightly lower [0.98 (0.72–1.32) vs. 0.99 (0.72–1.33), p = 0.02]; all differences were statistically significant.
Table 1 Baseline characteristics comparison between HBsAg loss and non-HBsAg loss groups at 24 weeksOther baseline characteristics—including sex, age, ethnicity, transmission route, treatment regimen (Peg-IFN ± NAs), HBV DNA, HBsAb, HBeAg, ALT, aspartate aminotransferase (AST), total bilirubin (TBIL), total protein (TP), albumin (ALB), blood urea nitrogen (BUN), creatinine (Cr), direct bilirubin (DBIL), hemoglobin (HB), platelet count (PLT), LSM, and AST to platelet ratio index (APRI)—showed no statistically significant differences between groups (all p > 0.05).
Characteristics of week 12 responseAt week 12 of treatment, patients in the HBsAg seroclearance group demonstrated significantly stronger early virological and immunological responses. As shown in Table 1, the HBsAg level was significantly lower than that in the non-clearance group [0.22 IU/mL (0.01–6.47) vs. 57.02 IU/mL (7.27–280.54), p < 0.001], and the decline in HBsAg was more pronounced [1.28 log₁₀ IU/mL (0.39–2.38) vs. 0.18 log₁₀ IU/mL (− 0.05–0.70), p < 0.001]. HBV DNA levels were also lower [0 IU/mL (0–11.08) vs. 0 IU/mL (0–24.50), p < 0.001]. Furthermore, HBsAb levels at week 12 were significantly higher in the clearance group [2.13 IU/L (0.03–6.75) vs. 0.54 IU/L (0–1.95), p < 0.001], suggesting a more robust immune response. ALT levels [59 U/L (41–77.68) vs. 57.77 U/L (40–85.9), p = 0.82] and fold changes in ALT [2.73 (1.88–4.31) vs. 2.72 (1.78–4.38), p = 0.39] showed no significant differences between the groups.
External validation cohort characteristicsThe external validation cohort included 167 patients, of whom 33 (19.8%) achieved HBsAg seroclearance at week 24. According to Supplementary Table 2, there were no significant differences in key baseline variables—such as sex (male: 58.7%), median age (42 years), HBV DNA [0 IU/mL (0–152)], HBsAg [110 IU/mL (15.2–350)], and ALT [21 U/L (16–27)]—when compared with the training set (n = 2017) and test set (n = 865) (all p > 0.05), indicating overall comparability across cohorts.
At week 12, the median HBsAg level in the external validation cohort was 26.9 IU/mL (IQR: 1.46–144), HBV DNA was 0 IU/mL (0–21), and ALT was 55 U/L (39–82); none of these showed significant differences compared with the training and test sets (all p > 0.05). The 24-week HBsAg seroclearance rates in the training, test, and external cohorts were 18.8%, 18.7%, and 19.8%, respectively (p = 0.95), supporting good external validity and generalizability of the model.
Feature selectionTo identify potential predictors of rapid hepatitis B surface antigen (HBsAg) seroclearance at 24 weeks, we conducted univariate logistic regression analysis in the training set to preliminarily screen candidate variables (p < 0.05). Subsequently, we applied two complementary feature selection methods: least absolute shrinkage and selection operator (LASSO) regression and Boruta algorithm, to reduce dimensionality and eliminate collinearity among variables. LASSO regression was performed using the “glmnet” package in R (version 4.2.2), with optimal regularization parameters determined by tenfold cross-validation. The Boruta algorithm, a wrapper built around a random forest (RF) classifier, was implemented using the “Boruta” package to iteratively assess variable importance and retain relevant features.
Variables identified by either LASSO or Boruta were further reviewed for clinical interpretability and redundancy. Highly correlated variables (Pearson’s r > 0.85) were considered for exclusion to avoid model overfitting. The final set of features selected by this two-step method was used to construct all subsequent machine learning (ML) models (see Fig. 2).
Fig. 2
Feature selection by two algorithms. a Boruta algorithm feature selection. Confirmed (green): Key predictors; Rejected (red): Non-significant; Shadow (gray): Synthetic references; Tentative (orange): Uncertain features. b LASSO regression coefficient profiles of variables in the training dataset. c Selection of the optimal parameter (lambda) in the LASSO regression. HBsAg 12w HBsAg level at week 12, ALT12w/HBsAg 12w Ratio of ALT to HBsAg at week 12, ALT12w/HBsAg baseline Ratio of ALT at week 12 to baseline HBsAg, HBsAg baseline Baseline HBsAg level, (Δlog HBsAg × ΔALT)12w Product of log decline in HBsAg and increase in ALT at week 12, HBsAg 12w decline > 1log Whether HBsAg decline exceeds 1 log IU/mL at week 12, DNA 12w HBV DNA level at week 12, HBsAb 12w HBsAb level at week 12, nation Nationality of the patient, ALT 12w up Increase in ALT at week 12, ALT baseline Baseline ALT level, ANC baseline Baseline ANC, TBIL baseline Baseline TBIL level, DNA baseline Baseline HBV DNA level, HB 12w HB level at week 12, ANC % baseline Baseline ANC percentage, HB baseline Baseline HB level, TP baseline Baseline TP level, ANC % 12w ANC percentage at week 12, WBC baseline Baseline WBC, AFP baseline Baseline AFP level, HBeAg baseline Baseline HBeAg level, HBsAb baseline Baseline HBsAb level, HBcAb baseline HBcAb level at baseline, HBeAb baseline Baseline HBeAb level, DBIL baseline Baseline DBIL level, TP 12w TP level at week 12, Cr baseline Baseline Cr level, PLT 12w PLT at week 12, AST baseline Baseline AST level, transmission Transmission mode, TBIL 12w TBIL level at week 12, sex Sex of the patient, WBC 12w WBC at week 12, DBIL 12w DBIL level at week 12, ANC 12w ANC at week 12, ALB baseline Baseline ALB level, ALB 12w ALB level at week 12, PLT baseline Baseline PLT, medical history the time of HBV infection, familyhistory Family History of Hepatitis B Infection, shadowMin Minimum shadow feature value (used for algorithm comparison), shadowMax Maximum shadow feature value (used for algorithm comparison)
Construction and evaluation of machine learning modelsModel performance comparisonBased on the training set (n = 2017), internal testing set (n = 865), and external validation set (n = 167), we evaluated the performance of nine commonly used machine learning (ML) algorithms: logistic regression (LR), decision tree (DT), random forest (RF), gradient boosting (GB), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM, LGB), support vector machine (SVM), multilayer perceptron (MLP), and naïve Bayes (NB). Model performance metrics are summarized in Tables 2, 3, and 4 and Fig. 3.
Table 2 Performance comparison of different machine learning models in the training setTable 3 Performance comparison of different machine learning models in the testing setTable 4 Performance comparison of different machine learning models in the external validation setFig. 3
Model performance evaluation: a–b. Training Set: ROC Curves (Left) and Precision–Recall Curves (Right). c–d. Internal Testing Set: ROC Curves (Left) and Precision–Recall Curves (Right). e–f. External Validation Set: ROC Curves (Left) and Precision–Recall Curves (Right)
In the training set (Table 2), most models achieved AUC values above 0.82. The RF model had the highest AUC (0.926) and sensitivity (0.905), but its precision was relatively low (0.524). The LGB model demonstrated a balanced performance with AUC of 0.902, sensitivity of 0.889, accuracy of 0.770, and F1 score of 0.617. Compared with GB (AUC = 0.897) and XGBoost (AUC = 0.901), the LGB model showed superior specificity (0.770) and precision (0.472), and it did not exhibit the low precision (0.516) observed in the DT model, indicating greater overall discriminative ability and stability.
In the testing set (Table 3), the LGB model maintained stable performance with an AUC of 0.891, accuracy of 0.806, sensitivity of 0.827, and F1 score of 0.615, outperforming other gradient boosting models. Although XGBoost achieved a slightly higher AUC (0.901), its precision (0.545) and F1 score (0.645) were comparable to LGB. The SVM model had high specificity (0.865), but lower sensitivity (0.747) and less consistent performance across datasets. The MLP model showed excellent sensitivity (0.901) but underperformed in precision and accuracy, resulting in a lower F1 score (0.581) compared to LGB.
In the external validation set (Table 4), the LGB model achieved the highest AUC (0.917), with accuracy of 0.832, F1 score of 0.674, sensitivity of 0.879, and specificity of 0.821—outperforming GB (AUC = 0.909) and XGBoost (AUC = 0.905). Although SVM had the highest accuracy (0.886) and F1 score (0.753) in the external set, its relatively low AUC in the training set (0.808) suggests potential overfitting to the validation cohort’s feature distribution. The RF model also showed high sensitivity (0.939) but low precision (0.477), with an F1 score of 0.633, falling short of the LGB model. ROC and precision–recall curves (Fig. 3) further confirm the consistent discriminative ability of the LGB model across datasets.
Final model selectionAfter comprehensive evaluation across all datasets, the LightGBM model was selected as the optimal predictive model for the following reasons:
Consistent performance: LGB maintained AUCs above 0.89 across the training, testing, and validation sets, with stable F1 scores ranging from 0.615 to 0.674 and no evidence of overfitting. Strong external validation: LGB achieved the highest AUC (0.917) and maintained balanced sensitivity (0.879) and specificity (0.821). Robust discriminative power: It offered balanced accuracy and F1 scores, performing well even in imbalanced data scenarios. Interpretability and clinical utility: LGB allows feature attribution analysis using SHAP, aiding clinical interpretation. Clinical benefit: Decision curve analysis (DCA) showed maximal net benefit within the threshold range of 0.2–0.8, with a low Brier score (0.091), indicating excellent calibration. Hence, LGB was ultimately adopted as the primary model in this study. Furthermore, its predictive efficacy was maintained in subgroup analyses stratified by treatment regimen (Supplementary Tables 4 and 5), demonstrating generalizability independent of concomitant NA therapy.
Model interpretationTo further elucidate the prediction logic of the LGB model, we applied SHAP to perform both global and individual-level interpretation (Fig. 4).
Fig. 4
SHAP interpretation of the light GBM model: a SHAP summary plot, Positive SHAP values (right) boost the predicted probability of HBsAg clearance, while negative ones (left) lower it. Features are colored from low (blue) to high (red), illustrating how different value ranges affect predictions. b Feature importance ranking, reflecting their average influence on predicting HBsAg clearance. Higher values indicate greater feature importance. c, d SHAP force plots for individual predictions: individual contributions from patients with HBsAg loss (c) and (d) without HBsAg loss. The baseline (model's average output) serves as a reference. Positive contributions (red) raise the predicted HBsAg clearance probability, while negative ones (blue) reduce it
Global feature importanceAs shown in Fig. 4a, b, the most predictive feature in the LGB model was the ratio of alanine aminotransferase to hepatitis B surface antigen at week 12 (ALT12w/HBsAg12w), with a mean absolute SHAP value of 0.970—substantially higher than other variables. This was followed by hepatitis B surface antibody at week 12 (HBsAb12w, SHAP = 0.276), HBV DNA at week 12 (DNA12w, SHAP = 0.223), and baseline HBsAg level (HBsAg_baseline, SHAP = 0.047). The ratio of ALT12w to baseline HBsAg (ALT12w/HBsAg_baseline, SHAP = 0.034) also contributed moderately.
Interestingly, the variable “HBsAg decrease > 1 log IU/mL at week 12” (HBsAg12w_down = 1) had a SHAP value of 0, indicating limited independent predictive contribution, possibly due to multicollinearity with other dynamic markers.
SHAP directional analysis (Fig. 4a) revealed that a higher ALT12w/HBsAg12w ratio positively influenced the probability of HBsAg seroclearance. Similarly, elevated HBsAb12w and suppressed DNA12w levels contributed positively, reflecting enhanced immune response and viral suppression. In contrast, high baseline HBsAg levels had a negative impact on prediction outcomes (blue region shifted left). These trends align well with clinical observations in interferon-treated IHCs achieving rapid seroclearance.
Individualized ExplanationTo illustrate the model’s individualized decision-making process, Fig. 4c, d presents SHAP local interpretation for two representative cases.
Patient 864 (Fig. 4c) achieved HBsAg seroclearance, with a predicted probability of 0.72. This prediction was driven by a high ALT12w/HBsAg12w ratio (1350), undetectable DNA12w (0 IU/mL), moderate ALT12w/HBsAg_baseline ratio (0.72), and > 1 log IU/mL decline in HBsAg at week 12. Despite negative HBsAb, the combined effect of these features significantly shifted the SHAP value to the right, illustrating the model’s integrated logic.
In contrast, Patient 11 (Fig. 4d) did not achieve seroclearance, with a predicted probability of only 0.01. This patient had a high baseline HBsAg (664.39 IU/mL), low ALT12w/HBsAg12w ratio (0.15), no significant HBsAg decline at week 12, negative HBsAb, and a low ALT12w/HBsAg_baseline ratio (0.12), all of which suppressed the prediction probability.
These findings highlight the LGB model’s ability to not only identify key predictors at the population level but also provide logical and interpretable insights at the individual level, especially emphasizing the predictive value of dynamic ALT–HBsAg ratios.
Development of clinical decision toolBased on the LGB model’s prediction probabilities, patients were stratified into low-, intermediate-, and high-risk groups (Supplementary Fig. 4). Except for the external validation cohort’s low- vs. intermediate-risk groups (p < 0.05), 24-week HBsAg seroclearance rates differed significantly among risk strata (p < 0.001). For clinical application, we recommend considering discontinuation of Peg-IFN treatment in low-risk patients (probability < 0.3) to reduce adverse effects and financial burden, while continuing therapy with dynamic evaluation in intermediate- and high-risk groups.
Furthermore, we developed a web-based clinical prediction tool based on the LGB model (https://starplan-mkwvjd9gnio3zeon5qqrsn.streamlit.app/; interface shown in Fig. 5). By inputting individual values for baseline HBsAg, HBsAg12w, ALT12w, DNA12w, and HBsAb12w, the tool provides a real-time probability (0–100%) of achieving HBsAg seroclearance, offering practical support for personalized treatment strategies in IHC patients.
Fig. 5
Web-based calculator interface for clinical application
Comments (0)