This study evaluated interobserver agreement in TIL scoring using two distinct approaches: WS and TMAs. The findings offer important insights into the reproducibility of TIL assessment and the challenges affecting observer consistency. The interobserver agreement for WS (Cohen’s kappa between 0.199 and 0.288) suggests only fair agreement, which is consistent with some reports about BC and other tumor types. For example, in the Danish study conducted by Tramm et al., kappa values were in the range of 0.38–0.46, corresponding to a fair to moderate agreement.[23] Similarly, studies evaluating TILs in colorectal cancer (CRC) show moderate to fair interobserver agreement using different grading systems.[24-27] A likely contributor to variability in our cohort is the difference in experience among observers, with senior pathologists typically demonstrating more consistent scoring.
TIL assessment on TMAs yielded higher consistency among observers. Cohen’s kappa values ranged from 0.489 to 0.539, suggesting moderate agreement. This trend aligns with previous findings by Kilmartin et al., where reproducibility improved when using limited and defined areas for scoring – akin to the constrained fields in TMAs.[28] However, other reports, for example, a study by Reznitsky et al., evaluated TILs’ reproducibility in HER2-positive BC using both WSs and TMAs, reporting an ICC of 0.93 for WSs versus 0.73 for TMAs, and kappa values of 0.75 and 0.33, respectively. These results showed significantly better reproducibility for WSs than TMAs – contrary to our findings in TNBC. This discrepancy may stem from differences in tumor biology, pathologist experience, or scoring intervals. Furthermore, Reznitsky et al. highlighted the impact of tumor heterogeneity and emphasized that TMAs must represent the full tumor architecture and microenvironment to be reliable. Their conclusion that WSs offer acceptable reproducibility while
TMAs do not underscores the importance of contextualizing scoring platforms within tumor subtype and study design.[29]
The ICC in our study indicated moderate to good reliability. For WS, the ICC was 0.6035, and for TMAs, it was 0.686, indicating more reliability in TMAs. These findings align with other studies in BC and CRC, where ICC values for TMA assessments generally fall within similar ranges. For instance, in a study by Koo and Li, ICC values for scoring TILs in BC were similarly moderate, reflecting the complexity of accurately assessing TIL density across WS.[30] The higher ICC for TMAs in our study supports the hypothesis that TMAs provide a more standardized and reliable method for evaluating TILs due to their reduced complexity and more controlled sampling process.
Likewise, other various studies report higher ICCs for TMAs due to their constrained and uniform sampling compared to WS, where variability in tumor heterogeneity can reduce consistency.[31,32]
The kappa values of TMAs (0.531, 0.489, and 0.539) represent moderate agreement, which aligns with reports suggesting better reproducibility for smaller, more focused tissue regions like TMAs.[33]
Our Fleiss’ kappa of 0.219 for WS and 0.394 for TMAs also aligns with the findings that TMAs typically provide more consistent results across observers.
Studies about immunohistochemical scoring on TMAs showed better results, especially when used by young pathologists with less expertise, perhaps due to their reduced complexity.[34] The reduced complexity and the limited tissue available on the TMA necessitate a simpler analysis and offer less intricate images for evaluation, thus enabling pathologists to obtain very similar results.
Our study highlights a notable discrepancy between TIL assessments in WSs and TMAs, with ICC for matched samples at 0.64 (95% CI: 0.55–0.71). This is comparable to results from other studies, which reported similar ICC values and emphasized the need for caution when extrapolating TIL data from TMAs to whole-section equivalents.[29]
In addition, another study comparing TMA to WS showed that the concordance rate between TMA and WS was only 0.26.[33] This is mainly due to the fact that TMAs are limited in mirroring the tumor heterogeneity since the larger surface and potential tumor heterogeneity presented in WS assessment may account for this variability, as observers may interpret TIL density differently across regions of different tumor composition.[28]
Consistent with findings in the literature, our results reinforce the notion that TIL evaluation in TMAs is less concordant and may not fully align with data obtained from WSs. Notably, studies have shown that reducing categories (e.g., dichotomizing TIL scores into “high” and “low”) can improve interobserver agreement in TMAs, but this simplification may compromise clinical nuance.[23,35,36]
The reduced reproducibility in TMAs can be attributed to their limited sampling area, which may not adequately represent the full spectrum of TIL distribution across the tumor. Studies suggest that full-section evaluation is more reliable for capturing the complex interactions between tumor and immune cells, making it the preferred method in both research and clinical settings.[37-39]
Moreover, while immunohistochemical staining methods in TMA-based studies have demonstrated higher concordance,[34] the use of H&E staining in TMAs remains challenging, with lower ICCs and kappa values observed across multiple studies.[26,40]
Interclass agreementTMA assessments yielded moderate agreement, with Cohen’s kappa values between 0.489 and 0.539. This, as previously mentioned, is consistent with previous studies suggesting that TMA offers better reproducibility due to its reduced tissue heterogeneity and standardized tissue sampling.[41,42] TMAs allow for the evaluation of a limited, predefined portion of the tumor, reducing the subjective nature of selecting tumor regions to score.[28,43] These results are likely due to the fact that tissue cores from TMAs are more likely to be homogenous.
In line with this, the interclass agreement, as measured by Fleiss’ kappa, was fair for both WS (0.219) and TMAs (0.394), further emphasizing the moderate reliability of TMA-based assessments. These results are similar to findings in the literature, where TMA has been shown to improve reproducibility compared to whole-section slide evaluations, although substantial interobserver agreement remains challenging across most tumor types.[27,44]
In BC, particularly TNBC, agreement was better for continuous measures of stromal TILs (ICC: 0.634, 95% CI: 0.539–0.735), but binary cut points resulted in poorer reliability. These findings are consistent with other studies reporting substantial interobserver variability, with ICCs ranging widely from 0.376 to 0.947. The variability observed emphasizes the need for standardization in TIL quantification to improve reproducibility and clinical utility.[28,44] Reducing complexity by dichotomizing TIL scores (e.g. “high” vs. “low”) improved agreement, suggesting potential pathways to enhance reliability.[36,45]
The discrepancies between WSs and TMAs, and between different staining methods, are critical considerations. WS provides a more comprehensive assessment of the tumor immune contexture, while TMAs, despite their efficiency, may fail to capture intratumoral heterogeneity.[28]
In this study, there are several notable limitations to consider. First, the sample size of 76 evaluable cases, although sufficient for preliminary analysis, may limit the generalizability of our results. Larger, multicenter studies are needed to validate these findings across broader populations and institutions. Second, observer training and variability in experience may have influenced scoring reproducibility. Structured calibration sessions or digital tools may help mitigate this limitation in future studies.
Comments (0)