Pilot validation study for a large image database of proximal femur fracture anteroposterior radiographs: Searching for the ground truth

Volume 57, Issue 3, March 2026, 113056

Author links open overlay panel, , , , , , , , Highlights•

Two expert groups show strong agreement on proximal femur fracture classification of radiographs using AO/OTA criteria.

•

The validated ground truth supports the reliable use of a large femur radiograph database for AI training.

•

The majority of interrater disagreements occurred between 31A1 versus 31A2 classifications and between no fracture versus 31B classifications.

•

Findings establish the quality of annotation needed for safe and clinically relevant AI decision-support tools.

AbstractPurpose

This pilot study aims to validate the "ground truth" accuracy and consistency of proximal femur fracture classification using a large radiographic image database. The project, a collaboration between expert groups from the University of Turin and the AO Foundation, seeks to ensure that expert consensus-based annotations are reliable for future artificial intelligence (AI) model development.

Methods

A cross-sectional, diagnostic accuracy study was conducted using a randomly selected subset of 300 anteroposterior pelvic radiographs from a single-center image repository created at the University of Turin within the AO Innovation Translation Center framework. Fracture classification annotations were independently provided by the local clinical expert group (LC-EG) and by an independent AO expert group of surgeons (AO-EG). To assess interrater reliability between the two groups, Cohen’s kappa coefficient was calculated for categorical agreement on the presence of a fracture and AO/OTA classification.

Results

The comparison of annotations from LC-EG and AO-EG yielded a Cohen’s kappa of 0.81 (95 % confidence interval: 0.75–0.87) and a percentage agreement of 87.67 % (95 % confidence interval: 87.63–87.70) for the classification of proximal femur fractures into three defined categories: no fracture, fracture type 31A, and fracture type 31B. These results confirm a high level of consistency between the two expert groups in annotating the image dataset.

Conclusion

The observed interrater reliability between the LC-EG and AO-EG supports the credibility of the reference annotations, establishing a validated ground truth for proximal femur fractures. This evidence justifies using the radiographic image database as a benchmark for future studies and as a foundation for transparent, reproducible AI development and evaluation, thereby facilitating safer integration of decision support tools into orthopedic trauma workflows.

Keywords

Hip fractures

Proximal femur fractures

Radiography

Diagnostic imaging

Artificial Intelligence

Fracture classification

Interrater reliability

Observer variation

View original article

INJURY-INTERNATIONAL JOURNAL OF THE CARE OF THE INJURED

Like