Author links open overlay panel, , , , , , , , Highlights•Two expert groups show strong agreement on proximal femur fracture classification of radiographs using AO/OTA criteria.
•The validated ground truth supports the reliable use of a large femur radiograph database for AI training.
•The majority of interrater disagreements occurred between 31A1 versus 31A2 classifications and between no fracture versus 31B classifications.
•Findings establish the quality of annotation needed for safe and clinically relevant AI decision-support tools.
AbstractPurposeThis pilot study aims to validate the "ground truth" accuracy and consistency of proximal femur fracture classification using a large radiographic image database. The project, a collaboration between expert groups from the University of Turin and the AO Foundation, seeks to ensure that expert consensus-based annotations are reliable for future artificial intelligence (AI) model development.
MethodsA cross-sectional, diagnostic accuracy study was conducted using a randomly selected subset of 300 anteroposterior pelvic radiographs from a single-center image repository created at the University of Turin within the AO Innovation Translation Center framework. Fracture classification annotations were independently provided by the local clinical expert group (LC-EG) and by an independent AO expert group of surgeons (AO-EG). To assess interrater reliability between the two groups, Cohen’s kappa coefficient was calculated for categorical agreement on the presence of a fracture and AO/OTA classification.
ResultsThe comparison of annotations from LC-EG and AO-EG yielded a Cohen’s kappa of 0.81 (95 % confidence interval: 0.75–0.87) and a percentage agreement of 87.67 % (95 % confidence interval: 87.63–87.70) for the classification of proximal femur fractures into three defined categories: no fracture, fracture type 31A, and fracture type 31B. These results confirm a high level of consistency between the two expert groups in annotating the image dataset.
ConclusionThe observed interrater reliability between the LC-EG and AO-EG supports the credibility of the reference annotations, establishing a validated ground truth for proximal femur fractures. This evidence justifies using the radiographic image database as a benchmark for future studies and as a foundation for transparent, reproducible AI development and evaluation, thereby facilitating safer integration of decision support tools into orthopedic trauma workflows.
KeywordsHip fractures
Proximal femur fractures
Radiography
Diagnostic imaging
Artificial Intelligence
Fracture classification
Interrater reliability
Observer variation
© 2026 The Authors. Published by Elsevier Ltd.
Comments (0)