Aims:
This study aims to develop an interpretable machine learning (ML) model for predicting the occurrence of advanced diabetic kidney disease (DKD), with the objective of identifying patients at an early stage of the disease, thereby facilitating timely and appropriate clinical intervention.
Methods:
Variable selection was performed using a combination of the least absolute shrinkage and selection operator (LASSO) and recursive feature elimination (RFE) techniques. A prediction model was constructed and validated using eight ML algorithms, and the model’s performance was evaluated using area under curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, Brier score, calibration curve, and decision curve analysis (DCA). The SHapley Additive exPlanation (SHAP) and partial dependence plot (PDP) methods were employed to interpret the model both locally and globally. Finally, the prediction model was integrated into a network platform based on the Shiny application for direct use by clinicians and patients.
Results:
Serum creatinine, age, hemoglobin, serum urea, serum ALP, serum UA, platelet count, serum osmolality, serum bicarbonate, and monocyte count were identified as the most important variables in the advanced DKD model. Eight ML models were developed using these five variables. Among them, the logistic regression (LR) model demonstrated accurate predictive ability in both internal and external validation, with AUCs of 0.948 (95%CI: 0.920-0.975) and 0.898 (95%CI: 0.883-0.913), respectively. Furthermore, the LR model exhibited excellent performance in terms of accuracy, sensitivity, PPV, NPV, F1 score, and Brier score. The results of the calibration curve and DCA also indicate a high degree of consistency between the predicted and observed risks of the RF model, with a net return approaching full coverage. The model developed is available through LR-based online calculators for clinicians, free of charge: https://dev2333.shinyapps.io/logistics1/.
Conclusion:
This study developed and validated an interpretable LR model for predicting the occurrence of advanced DKD. The LR model can assist clinical practice by effectively identifying individuals at higher risk of advanced DKD at an early stage, allowing patients to receive timely and personalized treatment, and thereby providing a reliable foundation for improving patient prognosis and optimizing medical resource utilization.
IntroductionDiabetic kidney disease (DKD), as one of the most severe microvascular complications of diabetes, has emerged as the leading cause of end-stage renal disease (ESRD) worldwide (1). According to the most recent report from the International Diabetes Federation, over 10.5% (536.6 million individuals) of adults globally are affected by diabetes, a figure projected to rise to 12.2% (783.2 million individuals) by 2045 (2). As the prevalence of diabetes continues to increase, the number of DKD cases worldwide has seen an explosive surge. Approximately 30-40% of diabetic patients will develop DKD, with about 50% progressing to ESRD, ultimately facing renal failure and requiring renal replacement therapy (RRT) (3). It is projected that by 2030, global RRT usage will more than double, increasing from 2.618 million individuals in 2010 to 5.439 million (4). Although RRT saves lives, its expanded use undeniably imposes a substantial economic burden on countries globally, particularly in many low- and middle-income nations. Consequently, early identification of high-risk patients in the middle and late stages of DKD, coupled with targeted interventions, has become essential for improving DKD prognosis and enhancing cost-effectiveness across nations.
Although several commonly used indicators, such as estimated glomerular filtration rate (eGFR) and urine albumin/creatinine ratio (UACR), are employed to assess the risk of DKD, they exhibit limitations in the early detection of advanced DKD, including low sensitivity and specificity, as well as a narrow detection window (5–7). For example, eGFR may underestimate the extent of kidney damage during early screening and fail to promptly detect minor changes in kidney function (8). Some patients with DKD may not exhibit proteinuria during the early stages of the disease or even as the disease progresses, resulting in false-negative results and, consequently, a missed diagnosis (9). Therefore, there is a pressing need to develop more sensitive and precise predictive tools to enable timely intervention while diabetic nephropathy remains in the subclinical stage.
With the rapid advancement of machine learning (ML) and artificial intelligence, particularly their widespread application in the medical field, numerous studies have begun to investigate the prediction of the onset and progression of DKD using ML. Chan et al. developed the KidneyIntelX model to predict advanced DKD using the random forest (RF) algorithm (10). Although it achieved a relatively high negative predictive value (NPV) of 0.9, its area under curve (AUC) was only 0.77, which remains unsatisfactory. Another approach involves constructing a recurrent neural network model by utilizing electronic medical records (EMRs) (11). This model demonstrates stable and high prediction accuracy for advanced DKD, but its interpretability is limited. The ML model developed by Zou et al. successfully predicted the risk of ESRD in patients with DKD (12). However, this study did not account for the validation of multi-center data. This suggests that while the development of ML models for predicting the risk of advanced DKD has yielded preliminary results, challenges remain, particularly regarding the interpretability of the models and the reliability of multi-center validation.
The present study aims to develop and validate an interpretable ML model for forecasting the occurrence of advanced DKD, utilizing a multi-center dataset, ML algorithms, SHapley Additive exPlanation (SHAP) and partial dependence plot (PDP) methods, to facilitate the early identification of patients with advanced DKD, thereby enabling timely and appropriate clinical intervention.
MethodsStudy design and populationDetails of the study design are shown in Figure 1. The present study collected data from 2359 patients diagnosed with DKD, admitted to Fuzhou University Affiliated Provincial Hospital between January 2013 and December 2024, forming the internal dataset. The inclusion criteria were as follows: (1) fasting plasma glucose (FPG) ≥ 126 mg/dL (7.0 mmol/L), or 2-h plasma glucose (2-h PG) ≥ 200 mg/dL (11.1 mmol/L) during the oral glucose tolerance test (OGTT), or hemoglobin A1c (HbA1c) ≥ 6.5% (48 mmol/mol), or random plasma glucose ≥ 200 mg/dL (11.1 mmol/L) (13); (2) for a minimum of 3 months, the presence of either of the following: albumin-to-creatinine ratio (ACR) ≥ 30 mg/g (3 mg/mmol), urine sediment abnormalities, persistent hematuria, electrolyte and other abnormalities caused by tubular disorders, histological abnormalities, structural abnormalities identified through imaging, or a history of kidney transplantation (14); (3) patients aged ≥ 18 years. The definitions of the various stages of chronic kidney disease (CKD) were based on the eGFR categories (G1–G5) (14). Based on the CKD stage, all patients were categorized into early DKD (G1-G2, n = 223) and advanced DKD (G3-G5, n = 2136). The study protocol adhered to the guidelines of the Declaration of Helsinki and was approved by the Ethics Committee of Fuzhou University Affiliated Provincial Hospital (K2025-02-116). This study was retrospective, and all data were anonymized, thus waiving the requirement for informed consent from patients.

Flow chart of the study design. NHANES, national health and nutrition examination survey; LASSO, least absolute shrinkage and selection operator; XGB, eXtreme gradient boosting; RF, random forest; RFE, recursive feature elimination; LR, logistic regression; SVM, support vector machine; NB, Naive Bayes; LGBM, light gradient boosting machine; GBM, gradient boosting machine; CV, cross-validation; AUC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value; DCA, decision curve analysis; SHAP, SHapley Additive exPlanation.
Data collectionIn this study, data from 43 variables were extracted from the EMR of inpatients, which included the following: (1) demographic information: age, gender, alcohol consumption history, smoking history, and BMI; (2) comorbidities: hypertension, anemia, heart failure, malignancy; (3) laboratory indicators: serum urea, serum uric acid (UA), serum inorganic phosphate (IP), serum creatinine, serum calcium, serum albumin, serum bicarbonate, serum glucose, neutrophil count, white blood cell (WBC) count, lymphocyte count, monocyte count, hemoglobin, hematocrit, platelet count, serum high-density lipoprotein cholesterol (HDLC), serum low-density lipoprotein cholesterol (LDLC), serum alkaline phosphatase (ALP), serum total bilirubin (TBIL), serum gamma-glutamyl transferase (GGT), serum alanine transaminase (ALT), serum aspartate transaminase (AST), serum total protein (TP), serum osmolality, serum lactate dehydrogenase (LDH), serum globulin, serum apolipoprotein B (apoB), serum apolipoprotein AI (apoAI), serum total cholesterol (TC), serum triglycerides (TG), plasma fibrinogen, urinary albumin, ferritin, and serum C-reactive protein (CRP).
External validationThis study utilized national health and nutrition examination survey (NHANES) data spanning from January 1988 to December 2018 for external validation. The NHANES protocol, which was approved by the National Center for Health Statistics Research Ethics Review Board, adhered to rigorous ethical standards, and all participants provided written informed consent. A total of 1559 patients with DKD were included in the external validation dataset. Patients were categorized into 979 cases in the early group (G1–G2) and 580 cases in the advanced group (G3–G5). The inclusion criteria and the collected variables for all patients were consistent with those of the internal dataset.
The NHANES was chosen as the external validation set for four key methodological and clinical reasons: (1) Heterogeneous population verification: The internal multicenter dataset originated from tertiary hospitals in Eastern China (predominantly Han Chinese), whereas NHANES is a nationally representative cross-sectional survey in the United States, including diverse ethnic groups (Caucasian, African American, Hispanic, etc.), different healthcare systems, and varying lifestyle characteristics. This cross-ethnic, cross-regional, cross-healthcare system validation is the gold standard for evaluating the generalizability of prediction models and effectively assesses the applicability of the model to diverse DKD patient populations globally. (2) Standardized and high-quality data: NHANES uses standardized laboratory testing methods, strict quality control, and detailed clinical variable collection, thereby ensuring the reliability of the external validation data. (3) Complementary clinical scenario: The internal dataset included patients with confirmed DKD who were referred to tertiary hospitals (severe disease bias), whereas NHANES includes community-dwelling individuals with DKD (mild to moderate disease). This complementary scenario allows us to verify the model’s performance in both clinical and community settings, which is essential for its utility in early screening. (4) Public availability and reproducibility: NHANES data are publicly available, allowing other researchers to reproduce our model and verify the results independently, thereby enhancing the study’s transparency and scientific rigor.
Data preprocessingMissing data in the internal dataset and the NHANES external dataset were imputed independently using the missForest algorithm at the individual dataset level, without the prior combination of the two datasets before imputation. missForest outperforms established imputation methods, such as k-nearest neighbor imputation and multivariate imputation using chained equations. MissForest can simultaneously handle multivariate data consisting of both continuous and categorical variables, without requiring parameter tuning (15). By imputing the missing data, missForest maintains the integrity of the dataset and ensures the reliability of subsequent analyses.
Selection of variablesThe combination of LASSO and RFE will be employed to select the quantitative variables: (1) LASSO will use 10-fold cross-validation (CV) to select the optimal λ value, minimizing the cross-validation error. This process will compress the coefficients of non-essential variables to zero, yielding an initial variable set with significant predictive value. (2) RFE will be applied to this initial variable set, reducing the number of variables from the maximum down to one based on the LR model. The optimal variable set will be determined by the accuracy curve from RFE, selecting the number of variables that results in the maximum model accuracy.
Model development and validationThe internal dataset was randomly divided into a training set and a test set in a 7:3 ratio. Eight machine learning models, including LASSO, RF, XGB, logistic regression (LR), light gradient boosting machine (LGBM), gradient boosting machine (GBM), support vector machine (SVM), and naive bayes (NB), were used to predict the occurrence of advanced DKD. Hyperparameter optimization was performed on all eight ML models to ensure optimal model performance. The grid search method was used for hyperparameter tuning, and the performance of different hyperparameter combinations was evaluated by 10-fold CV on the internal training set. For each model, the optimal hyperparameter set was selected based on the highest AUC value (the primary evaluation measure).
The performance of the prediction model was evaluated using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), NPV, F1 score, and Brier score. Additionally, calibration curves were plotted to assess the accuracy of the model’s predicted probabilities, and decision curve analysis (DCA) was performed to evaluate the net benefit of the model across different decision thresholds. The model was developed based on the training set, after which the prediction model with superior performance—according to the aforementioned evaluation metrics—was selected and subsequently validated on the test set and external datasets.
Model explanationSHAP is employed to interpret the prediction model. SHAP is a model interpretation method grounded in game theory. It provides both local and global explanations by calculating the average contribution of each variable to the model’s predictions, thus addressing the “black box” problem and enhancing the model’s transparency and interpretability. Moreover, the PDP visually illustrates the marginal effect of a single variable on the model’s predictions, aiding in the identification of complex nonlinear relationships between variables and prediction outcomes, thereby enhancing the transparency of the model’s decision-making process. SHAP and PDP analyses were performed using the R programming language (version 4.5.0), calculated and visualized using the shapviz (version 0.9.0) and PDP (version 0.8.1) packages.
Network calculatorTo facilitate the application of the model in a clinical setting, the final prediction model was integrated into a network platform based on the Shiny application. When the values of the relevant variables in the final model are provided, the application returns the probability of occurrence, identifies important features, and generates bee colony maps and waterfall maps for the advanced DKD.
Determination of minimum sample sizeThe sample size required for this study was determined based on the clinical prediction model sample size calculation method proposed by Riley et al. (16). The first step involves determining the number of samples needed to accurately estimate the average risk or mean. The second step involves determining the number of samples required for the error between the predicted and true values of the model to be minimized. The third step involved determining whether there was an adequate sample size to prevent overfitting. The fourth step involves determining whether there is a sufficient sample size to minimize the error between the predicted and actual performance of the model. Finally, the maximum value from the results of the above four-step calculation was selected as the required sample size.
StatisticsStatistical analyses were performed using R (version 4.5.0, R Foundation). Continuous variables with a normal distribution are presented as the mean ± standard deviation and were compared using the t-test. Continuous variables with skewed distributions are presented as medians with interquartile ranges and compared using the Mann–Whitney U test or the Kruskal–Wallis H test. Categorical variables are presented as counts with percentages and compared using the chi-square test. The AUC was used to evaluate predictive power, and the optimal cutoff value was determined by maximizing the Youden index. A two-tailed P value < 0.05 was considered statistically significant.
ResultsClinical characteristicsTable 1 presents the demographic and clinical characteristics of all patients in the internal and external datasets. In the internal dataset, compared with patients in the early stage of DKD, those in the advanced stage of DKD exhibited a higher incidence of hypertension, anemia, and heart failure. Additionally, patients with advanced DKD were older and had a lower BMI. The assessment of laboratory characteristics revealed that serum urea, serum UA, serum IP, serum creatinine, neutrophil count, serum ALP, serum osmolality, serum LDH, and serum globulin levels were all higher in patients with advanced DKD. In contrast, these patients exhibited decreased levels of serum calcium, serum albumin, serum bicarbonate, lymphocyte count, hemoglobin, hematocrit, platelet count, serum HDLC, serum LDLC, serum TBIL, serum ALT, serum TP, serum apoB, serum apoAI, serum TC, and serum ferritin compared to patients with early DKD.
VariableEarly DKDThe comparison of the demographic and clinical characteristics of patients with early DKD and advanced DKD in both internal and external datasets.
DKD, diabetic kidney disease; BMI, body mass index; Serum UA, serum uric acid; Serum IP, serum inorganic phosphorus; WBC, white blood cell count; Serum HDLC, serum high-density lipoprotein cholesterol; Serum LDLC, serum low-density lipoprotein cholesterol; Serum ALP, serum alkaline phosphatase; Serum TBIL, serum total bilirubin; Serum GGT, serum gamma glutamyl transferase; Serum ALT, serum alanine aminotransferase; Serum AST, serum aspartate aminotransferase; Serum TP, serum total protein; Serum LDH, serum lactate dehydrogenase; Serum apoB, serum apolipoprotein B; Serum apoAI, serum apolipoprotein AI; Serum TC, serum total cholesterol; Serum TG, serum triglycerides; Serum CRP, serum C-reactive protein. Bold values indicate statistically significant differences.
In the external dataset, the incidences of hypertension, anemia, heart failure, and malignant tumors in patients with advanced DKD were significantly higher than those in patients with early DKD, and the age of patients with advanced DKD was also greater. The laboratory characteristic assessment revealed that, compared with patients with early DKD, patients with advanced DKD exhibited elevated levels of serum urea, serum UA, serum creatinine, neutrophil count, serum osmolality, serum LDH, serum apoB, serum TC, serum TG, plasma fibrinogen, and urinary albumin. In contrast, serum albumin, lymphocyte count, hemoglobin, hematocrit, platelet count, serum TBIL, serum GGT, serum ALT, and serum AST levels decreased significantly in patients with advanced DKD.
Selection of variablesAs shown in Figures 2A–C, the LASSO CV curve indicates that the optimal λ value, corresponding to the minimum CV error, occurs when the number of variables is 10. This identifies the top 10 variables as the primary screening variable set, which exhibits significant predictive value. Based on the LR model, RFE was performed on this primary screening variable set. The RFE accuracy curve demonstrated that when the number of variables was 10, the model’s accuracy reached its maximum value and remained stable. In contrast, when the number of variables was reduced to fewer than 10, the model’s accuracy decreased (Figure 2D). Ultimately, the optimal variable set consists of: serum creatinine, age, serum urea, hemoglobin, serum osmolality, platelet count, serum bicarbonate, serum ALP, monocyte count, and serum UA (Figure 2E).

LASSO and RFE were combined for variable screening. (A) The LASSO coefficient path; (B) The LASSO cross-validation error curves; (C) The absolute importance of variables obtained through LASSO screening; (D) The relationship between the number of variables in RFE and the accuracy; (E) The absolute importance of variables obtained through RFE screening.
Model development and validationOn the basis of the training set, eight ML models were constructed using the optimal set of variables and further validated using the test set for internal validation and the NHANES dataset for external validation. As shown in Figures 3A, C, in the training set, the LR model showed excellent performance in terms of discrimination and net benefit, with an AUC of 0.941 (95% CI: 0.926-0.956) and net benefit within the full threshold range of 0-1.0. In addition to good calibration (Brier score = 0.05), the LR model also had high accuracy (0.931), sensitivity (0.977), PPV (0.948), NPV (0.707), and F1 score (0.962) (Figures 3B, D; Table 2).
Comments (0)