Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

Abstract

Background Delivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents’ scholarly projects and compared its performance with expert human evaluators.

Methods We evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ingests heterogeneous resident submissions (PDFs, scans, photographs) via OCR and produces section-by-section feedback aligned with programme rubrics. In a three-phase study we evaluated 240 feedback reports (Short, Question and Timeline, Final; n = 80 each). Within each phase, 40 reports were AI-generated and 40 produced by research experts across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters used a 25-item survey across five constructs: understanding & reasoning, trust & confidence, quality of information, expression style & persona, safety & harm.

Results Survey reliability was high across phases (α = .71–.98). Human feedback generally out-scored AI. In short reports, humans led on quality (Mean ± SD; 4.14 ± 0.57 vs 3.09 ± 1.05) and trust (3.96 ± 0.71 vs 2.78 ± 1.15). In final reports, differences become small for quality (4.09 ± 0.65 vs 3.49 ± 0.68) and persona (4.16 ± 0.40 vs 3.91 ± 0.50), while AI was preferred for safety (4.50 ± 0.60 vs 4.36 ± 0.56). Performance varied by project type: in survey-based final reports the AI led on quality (4.28 ± 0.50 vs 3.98 ± 0.44) and safety (4.58 ± 0.40 vs 4.24 ± 0.67), whereas in quality-improvement short reports humans were markedly superior in reasoning (4.27 ± 0.68 vs 2.33 ± 1.00).

Conclusions An open-weight LLM with curated prompts can generate rubric-aligned feedback at scale that approaches the quality of expert human feedback. While expert feedback remained superior overall, AI surpassed humans in selected contexts and safety assessments. Performance of the tool will increase over time as newer and more capable open-weight models are released. Our code and systems prompts are open source.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

n/a

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This project falls under TCPS2 Article 2.5 and, as such, did not require Research Ethics Board (REB) review, as confirmed by the University of Ottawa REB during an initial review for exemption and reported on August 29, 2024.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Code will be made available following publication

View original article

Medrxiv - Medical Education

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

Comments (0)