Preliminary analysis of AI-based thyroid nodule evaluation in a non-subspecialist endocrinology setting

The present study evaluated the impact of an AI-based diagnostic support system on the ultrasound assessment of thyroid nodules in a screening setting conducted by specialists without specific training in thyroid imaging, within a very low malignancy prevalence context. To date, studies on these systems have focused on subspecialist environments or have been conducted by experts, either in training or with experience [10, 12]. However, their utility in low-complexity settings, where thyroid nodules are predominantly benign, has not been explored.

Our results indicate that using AI-DSS independently of clinician evaluation does not significantly improve diagnostic accuracy, optimize risk stratification, or reduce the number of referrals for additional studies in subspecialized units. In fact, this study observed low agreement between the evaluations performed by GE and the AI-DSS alone, particularly in key risk descriptors such as composition, echogenic foci, and echogenicity. Overall, the AI-DSS adopted a conservative approach when classifying nodules, leading to a higher number of follow-up or FNA recommendations compared to the clinical evaluation by the GE.

In recent years, various AI models have demonstrated diagnostic performance comparable to that of expert endocrinologists and radiologists in thyroid nodule ultrasound assessment, achieving similar sensitivity and specificity in risk classification according to scales such as ACR TI-RADS [13]. The AI algorithm analyzed in this study has been shown to reduce interobserver variability and optimize decision-making, decreasing unnecessary FNAs without compromising the detection of malignant nodules [10, 12, 14]. However, these studies were conducted in highly subspecialized centers, with the participation of clinicians with advanced training in thyroid imaging, where malignancy prevalence was relatively high [10, 12, 14]. This raises questions about the applicability of this AI-DSS in a setting with lower thyroid ultrasound expertise and low malignancy prevalence, where the main goal is screening and referral rather than immediate risk stratification.

In this scenario, AI could serve as a potential support tool to improve workflow and optimize referrals to subspecialized units. To our knowledge, this is the first study to evaluate the performance of an AI-DSS in a real-world low-complexity clinical environment, providing autonomous performance data outside of subspecialized reference centers or validation studies.

Significant differences were observed in thyroid nodule classification between GE, AI-DSS, and TNC evaluation, particularly in nodule composition, echogenicity, and echogenic foci. While general GE classified 38.7% of nodules as spongiform, AI-DSS assigned this category to only 1.3%, despite the fact that spongiform nodules are virtually always benign [1]. Furthermore, AI-DSS classified 52% of nodules as hypoechoic or very hypoechoic, compared to 16% reported by GE. Regarding echogenic foci, AI detected comet tail artifacts in only 1.3% of cases, whereas GE reported them in 21.3% of cases, which has a substantial impact on final risk categorization. Similarly, when classified using ATA guidelines, AI-DSS reduced the proportion of nodules categorized as benign (1.6% vs. 22.7% by GE) and increased high-suspicion classifications (39.7% vs. 4%). These results further confirm that AI-DSS follows a more conservative approach, tending toward risk overestimation.

Several plausible explanations exist for the observed discrepancies. These include technical characteristics of ultrasound imaging, the ultrasound equipment used, non-standardized image settings, the static nature of analyzed images, and the clinical environment for which AI-DSS was trained. Although Koios DS is designed to process DICOM-format images and is compatible with various ultrasound devices, its performance may be affected by image quality and settings [15]. In this study, images were obtained by endocrinologists without specific imaging expertise, though they routinely used ultrasound as a screening tool in daily practice. Additionally, pre-study training was conducted to ensure correct image acquisition. This scenario is comparable, if not superior, to what would be encountered in general medicine settings. Furthermore, the static nature of the images may have hindered echogenic foci assessment, particularly in differentiating microcalcifications from comet tail artifacts. This limitation is common to most commercially available AI systems, regardless of the environment in which they are tested. Despite previous studies demonstrating that AI-DSS performance is comparable to that of a subspecialist in thyroid nodule evaluation, guidelines recommend its use as an adjunctive tool. Our findings reinforce the importance of considering AI-DSS as a complementary tool within clinical evaluation, especially in low-complexity environments dominated by benign nodules, rather than as a replacement for physician assessment.

The Koios DS regulatory framework emphasizes that its use should serve as an additional support to clinical evaluation by trained endocrinologists, without replacing human diagnostic interpretation [10, 12]. However, the risk of fully delegating ultrasound assessment to AI systems remains a concern, particularly in settings with limited training in thyroid nodule pathology. Similarly, prioritization of nodules requiring further investigation in subspecialized units is a key factor that AI systems must address to ensure optimal performance.

Thyroid ultrasound is primarily a screening tool, designed to rule out malignancy and minimize the need for invasive procedures, ensuring a high negative predictive value. Indeed, new guidelines and classification systems aim to reinforce the reduction of procedures, minimizing overdiagnosis and overtreatment of differentiated thyroid carcinoma [1, 16]. Previous studies conducted in high-expertise settings with greater malignancy prevalence have demonstrated that AI-DSS can improve diagnostic accuracy and reduce interobserver variability [10, 12]. However, in our study, conducted in a real-world clinical environment with very low malignancy prevalence, AI-DSS tended to overclassify malignancy risk, leading to an increased recommendation for FNA (37.3% with AI vs. 30.7% without AI). This suggests that transferring AI models trained in high-risk settings to lower malignancy prevalence environments may compromise their performance, leading to excessive referrals without clear clinical benefits. Therefore, it is crucial to adapt and train AI algorithms specifically for these clinical settings to optimize their utility and prevent over-referrals and associated risks. As shown in Table 3, agreement analysis demonstrated low concordance in all ACR-TIRADS and ATA characteristics between the GE and AI-DSS. However, when comparing TNC subspecialist evaluations with AI-DSS, moderate-to-high agreement was observed for most ultrasound features. This level of agreement is comparable to that observed between GE and TNC subspecialist and is consistent with previous studies [6, 7, 17]. From our perspective, the increased agreement when analyzing nodules of intermediate-to-high suspicion that were considered FNA candidates by GE supports the notion of an AI-DSS adaptability issue in low malignancy prevalence settings.

Finally, the use of AI-DSSs is inherently influenced by the risk stratification scale applied. In our study, 31.9% of thyroid nodules were reclassified depending on whether the ATA 2015 or ACR TI-RADS system was used, as shown in Fig. 1. These discrepancies between classification systems are well-documented in the literature [18], and they can significantly affect both the diagnostic performance and clinical recommendations—even when such recommendations are generated by the AI itself.

This study represents the first analysis of AI-DSS use in a real-world clinical setting characterized by low malignancy prevalence. Despite the limited sample size, the statistical power calculation ensured statistical significance and the robustness of the findings, accurately reflecting thyroid nodule evaluation in general practice. The random selection of thyroid nodules and participation of non-thyroid subspecialist endocrinologists helped minimize bias and enhance the study’s external validity. Furthermore, our findings align with published literature, reinforcing their reliability. Finally, the absence of malignant thyroid pathology in the analyzed sample may be considered a limitation, as it does not allow for the evaluation of the diagnostic performance of the AI-DSS. However, this reflects the real-world scenario of thyroid nodule assessment in low-malignancy settings, where the prevalence of thyroid nodule is high and ultrasound is used exclusively as a screening tool.

In conclusion, non-adjunct AI-DSS use did not significantly improve risk stratification or reduce hypothetical referrals for additional studies to subspecialized units in a low-complexity thyroid nodule setting. The system tended to overestimate risk, potentially leading to unnecessary procedures. Additionally, this study found low agreement between AI-DSS and general endocrinologists in various ultrasound descriptors, highlighting the need for further optimization of AI tools in low-prevalence environments.

Comments (0)

No login
gif