The COPDGene Study is a 21-center, ongoing longitudinal observational study of non-Hispanic White (NHW) and African American (AA) individuals, most of whom had a > 10 pack-year smoking history [13]. Study subjects were initially enrolled from 2007 to 2011, and each subject received extensive lung phenotyping at five-year intervals, incorporating spirometry, qCT imaging phenotypes, and questionnaires. The current study used the 5-year follow-up (Phase 2) data, which included more comprehensive features linked to dyspnea, such as the self-reported Hospital Anxiety and Depression Scale. [14] The data used for this analysis included 5016 individuals aged 45–80 with > 10 pack-years of smoking history. The ECLIPSE study, which included 2290 subjects, the majority of whom were NHW aged 40–75 with a smoking history of > 10 pack-years in the United States (US) and Europe, was used for external validation.
Feature Selection and Data PreprocessingResponse VariableIn this paper, the presence of dyspnea (yes/no) was the dependent variable for association analyses and predictive models. Former and current smokers with dyspnea were defined as those with a self-reported modified Medical Research Council (mMRC) dyspnea score of 2 or higher, based on previous literature that identified cutpoints [15, 16]. The mMRC dyspnea scale ranges from 0 (dyspnea only on strenuous exercise) to 4 (dyspnea when dressing or too dyspneic to leave the house) [16, 17].
Feature SelectionFeature selection occurred in two steps. Firstly, clinical domain knowledge and the literature guided the selection of features from three data types: clinical history, spirometry, and qCT imaging. Next, we excluded variables with over 20% missing values. Correlations between continuous variables (Pearson’s correlation) and categorical variables (Kendall’s rank correlation) were calculated (Supplemental Figure 1). For pairs of variables with correlations exceeding 0.8, only one was retained for analysis to minimize multicollinearity based on clinical relevance. Detailed information on the final set of variables and their definitions is listed in Supplemental Table 1.
Data PreprocessingThe COPDGene study dataset was randomly divided into a training and a test set. Eighty percent of the subjects comprised the training cohort for dyspnea prediction model development, and the remaining 20% comprised the test cohort for internal validation. The ECLIPSE dataset was used for external validation.
Statistical AnalysisBivariable and multivariable logistic regression analyses examined the association between the presence of dyspnea (yes/no) and clinical history, spirometry, and qCT imaging variables. In these models, the presence of dyspnea (yes/no) was the dependent variable. The clinical history, spirometry, and qCT imaging were included as independent variables. Continuous variables were mean-centered and scaled (divided by standard deviation) before regression analyses. Each variable was tested for association with dyspnea in separate (i.e., one model for each variable of interest) bivariable models. Each variable was also assessed in a separate multivariate model adjusting for GOLD spirometric stage [6], age, sex, and race. The variable “GOLD spirometric stage” was defined as spirometry-defined COPD severity as used by GOLD and coded into four groups for these analyses: normal spirometry (GOLD stage 0), preserved ratio impaired spirometry (PRISm), mild COPD (GOLD stage 1), and moderate to severe COPD (GOLD stages 2–4; collapsed into a single category). Smokers with normal spirometry were the reference group. Finally, multivariate regression analysis with interaction terms was conducted to test the potential interaction of spirometry-defined COPD severity used by GOLD with comorbidities, spirometry, and qCT imaging factors in predicting dyspnea.
Prediction Model TrainingElastic net regression was used to build the dyspnea predictive model using the ‘glmnet’ R package [18]. Alpha and lambda values were optimized entirely within the training set using the cross-validation procedure implemented in ‘glmnet’. The final models were fit exclusively on the training data using the optimal alpha and lambda value derived from the training set. Subsequently, to test for interactions between GOLD spirometric stage (COPD severity) and clinical history, spirometry, and chest CT imaging in relation to dyspnea, we developed a linear hierarchical pairwise-interaction model, known as the Group-Lasso INTERaction-Net (GLINTERNET) model, using the ‘glinternet’ R package [19]. The GLINTERNET model automatically selects and estimates main effects and pairwise interactions with COPD severity, using group lasso regularization while enforcing strong hierarchical constraints, ensuring that interactions are included only when their main effects are relevant. Additionally, we conducted stratified analyses based on COPD status to evaluate whether the model’s performance varied by creating three distinct elastic net prediction models: (1) all former and current smokers; (2) former and current smokers without COPD (GOLD stage 0); and (3) former and current smokers with moderate to very severe COPD (GOLD stages 2–4).
Model Comparisons with Different Variable SetsTo develop clinically relevant prediction models for dyspnea, we developed eleven models using different combinations of variables and tested their performance. Supplemental Table 1 shows the variable sets used.
Prediction Model EvaluationFor each model, we calculated the AUROC, and the Brier score and Spiegelhalter z tests were used to measure model calibration [20]. Pairwise comparison of model performance was performed using the DeLong test. Variable importance was assessed using the ‘varImp’ function in the Caret R package, which calculates variable importance scores based on the beta-coefficients from an elastic net model [21]. Then, we selected the top 10 continuous and categorical variables, respectively, and created variable importance plots. All analyses were conducted using R software (version 4.1.3). To ensure transparency and completeness in reporting, the study adhered to the TRIPOD guidelines [22]. The TRIPOD checklist (Supplemental Table 2) was used to guide the presentation of study objectives, data sources, statistical methods, and results.
Comments (0)