Analysis of deep learning-based segmentation of lymph nodes on full-dose and reduced-dose body CT

Patient sample

This retrospective study utilized two publicly available Cancer Imaging Archive (TCIA) datasets: (1) the NIH CT Lymph Node dataset containing 176 contrast-enhanced CT volumes [22, 23] and (2) the mediastinal lymph node quantification (LNQ) dataset [24]. The study is Health Insurance Portability and Accountability Act compliant and approved by the Institutional Review Board at Institution-A. The requirement for signed informed consent from the patients was waived. Table 1 describes the patient characteristics and the CT acquisition parameters. A full description of the diverse patient disease etiologies for both datasets is provided in Supplementary Tables 1 and 2. Figure 1 summarizes the data collection procedure using a Standards for Reporting Diagnostic Accuracy (STARD) chart.

Table 1 Patient characteristics and CT acquisition technique for the two public TCIA datasetsFig. 1figure 1

Standards for Reporting Diagnostic Accuracy (STARD) chart describing the patient sample used to evaluate two fully automated deep learning-based approaches for lymph node segmentation

Supplementary Table 1 describes the various disease etiologies for the patients in the public TCIA NIH CT Lymph Node dataset. Briefly, the patients had lymphadenopathy arising from various disease etiologies, such as melanoma, chronic lymphocytic leukemia (CLL), lymphoma, mesothelioma, cancers of the thyroid, lung, thymus, breast, esophagus, liver, pancreas, kidneys, colon, ovaries, and among others. Identification and removal of 25 duplicate volumes resulted in a total of 151 full-dose CT volumes for this study. Of these, 90 volumes (filenames starting with “MED”) were used for training, including 87 chest-abdomen-pelvis scans, 1 chest-abdomen, 1 abdomen-pelvis, and 1 chest scan. The remaining 61 volumes (filenames starting with “ABD”) were reserved for testing and consisted of 56 chest-abdomen-pelvis scans and 5 abdomen-pelvis scans.

The external public TCIA mediastinal LNQ dataset contained 513 patients and came from multiple institutions (Massachusetts General Hospital, Dana-Farber Cancer Institute, Brigham and Women’s Hospital). Partial annotations of lymph nodes were provided for 393 patients, while 120 patients had all mediastinal lymph nodes fully annotated. The fully annotated data subset was selected for analysis in this work. Supplementary Table 2 provides full details on the diverse patient disease characteristics for this data subset. Briefly, it included those patients with non-small cell lung cancer (NSCLC), CLL, lymphoma, and various other cancer types.

Reference standard

Segmentation masks were provided with the TCIA NIH CT Lymph Node dataset for LNs that were considered enlarged and clinically significant (SAD ≥ 10 mm) [22, 23]. However, upon thorough review, several LNs with SAD ≥ 10 mm were not annotated in this dataset. Additionally, in many instances, masses were annotated instead of nodes.

To obtain a reliable reference standard, two board-certified radiologists (30 + years of experience) and four radiology residents (2 + years of experience) were recruited to identify LNs of all sizes in the 151 full-dose CT volumes. The residents initially annotated 8973 LNs in the 90 full-dose CTs (average of 99 LNs per volume). The testing data subset of 61 CTs was verified entirely by the two board-certified radiologists (30 + years of experience). The first radiologist reviewed the resident annotations and identified the center point (3D-coordinate) of nodes that were either missed or incorrectly annotated. Next, a research fellow (2 + years of experience) manually corrected the annotations. Finally, the second radiologist independently reviewed the nodes identified by the first radiologist and annotations performed by the fellow. If the second radiologist spotted annotation errors or any nodes that were missed, their centers were marked, and the same research fellow corrected them. For example, annotations made incorrectly in the pericardial recess were erased. A total of 6368 LNs were identified in the 61 volumes by the senior radiologists.

The full dataset contained 15,341 LNs of all sizes. To account for variability in measurements by radiologists, this work considered LNs with SAD ≥ 8 mm to be large and nodes with SAD ≥ 3 mm but below 8 mm to be small. Nodes with SAD < 3 mm were not considered as they were less likely to be of clinical significance. The training dataset contained 2139 large nodes and 5966 small nodes (868 nodes with SAD < 3 mm were ignored). The test dataset contained 800 large nodes and 4078 small nodes (1490 nodes with SAD < 3 mm were ignored).

Generation of reduced-dose CT

Reduced-dose CTs were simulated with a previously validated image-domain noise-addition tool [25]. The tool injected synthetic noise into the standard dose CT by estimating and replicating the magnitude, texture, and non-stationary properties of noise observed in reduced-dose CT acquisitions. Specifically, it performed forward projection of the CT image data, white noise generation in the sinogram domain, filtering based on the image-specific noise power spectrum, followed by back-projection and dose-dependent scaling of the noise prior to its addition to the image. The intent behind the use of reduced-dose CT was to leverage a 3D lymph node segmentation model trained only on full-dose CT and evaluate its performance on reduced-dose CT, where the quality has been degraded. Figure 2 shows an example of the degraded quality of the reduced-dose CT in comparison to the original full-dose.

Fig. 2figure 2

Visual example of degraded image quality and lymph node (red) visualization on a full dose CT (left) vs. 10% dose CT (right). The yellow arrows highlight two enlarged lymph nodes in the retroperitoneum of a male patient (age 61 years) with short axis diameter (SAD) of 2.3 cm and 1.7 cm, respectively. Two smaller lymph nodes are not clearly visible in the reduced dose CT

For each of the 61 full-dose testing volumes, five corresponding reduced-dose CTs were generated at 75%, 50%, 25%, 10%, and 5% dose settings. As a result, including the full-dose CT, a total of 366 CT volumes were available for testing (61 CTs at each dose). As the anatomy did not change when simulating a reduced-dose CT, the segmentation mask from the full-dose CT was directly transferred to the reduced-dose CT.

Deep learning model

Figure 3 shows the overall framework. A previously validated deep learning-based 3D model [26] trained with the 90 full-dose CTs was used to segment LNs. We refer to this model as the “NoAugmentation” model henceforth. The model was built with the nnU-Net framework [27], which is regarded as the de facto standard for segmentation tasks [27] due to its superior performance on many tasks, such as multi-organ segmentation in CT and MRI among others [27]. It has often outperformed other architectures, such as transformer-based approaches [28]. The 3D full-resolution nnU-Net model segmented 29 different structures, which included LNs and other anatomical structures (e.g., liver, pancreas, skeleton). As LNs straddle major organs, the anatomical priors were used to reduce the number of false positives by distinguishing nodes from other adjacent structures. At test time, only the predicted LNs were retained while the remaining 28 classes were discarded.

Fig. 3figure 3

Framework for the detection and segmentation of lymph nodes on full- and reduced-dose body CT. The input CT volume (full or reduced-dose) was fed to a 3D full-resolution nnU-Net that segmented the lymph nodes (green) along with other anatomical structures (various colors). The reference annotation in 2D as well as in 3D is shown for comparison

Comparison

Another 3D nnU-Net model was trained on an augmented dataset comprising of both the full-dose and reduced-dose simulations. The augmented dataset had 540 CT volumes including both the full-dose and reduced-dose CTs generated at 75%, 50%, 25%, 10%, and 5% dose settings (90 CTs at each dose). We refer to this model as the “Augmentation” model henceforth. The performance of the Augmentation model was compared against the NoAugmentation model described above.

Statistical analysis

Detection performance was evaluated using precision (or positive predictive value), sensitivity (or recall), and F1-score (harmonic mean of precision and sensitivity). Segmentation performance was quantified with Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) error (in mm). Dice score measures the overlap between the reference annotation and the prediction, while HD error estimates the distance between the two. Higher Dice scores and lower HD errors indicate good performance. Metrics were calculated for three groups: (1) all LNs, (2) large nodes with SAD ≥ 8 mm, and (3) small nodes with SAD ≥ 3 mm but below 8 mm. Model performance at different doses were statistically evaluated using one-way repeated measures Analysis of Variance (ANOVA) and post-hoc tests that adjusted (Bonferroni) for multiple comparisons (“anova_test” and “wilcox_test” functions, “rstatix” package, RStudio v. 2024.04.2 + 764). A p-value < 0.05 was considered statistically significant.

Comments (0)

No login
gif