Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

Abstract

Introduction Evaluate large language models (LLMs) for scoring medical student essays, and compare various prompting techniques and models.

Methods OpenAI GPT scored 51 medical student reflection essays (15 real, 36 fabricated) using a previously-reported 6-point rubric (April-May 2025). We compared 29 prompt-model conditions by systematically varying the LLM prompts (including the persona, scoring rubric, few-shot learning [exemplars], chain-of-thought reasoning, and temperature), fine-tuning, and model (including GPT-4.1, GPT-4.1-mini, GPT-o4-mini, and GPT-4-Turbo). Outcomes were accuracy (compared with human raters, measured using single-score intraclass correlation coefficient [ICC] and mean absolute difference [MAD; zero indicates perfect agreement]), within-condition reproducibility, and cost.

Results Across all conditions, it took mean (SD) 3.73 (3.12) seconds to score 1 essay. The cost to score 100 essays was USD $0.04 for GPT-4.1-mini, $0.21 for GPT-4.1, $0.57 for GPT-4.1 with 3 exemplars, and $2.00 for fine-tuned GPT-4.1. When the one-time cost of fine-tuning was amortized across 10,000 essays, the cost for fine-tuned GPT-4.1 was $0.20 per 100. Accuracy was “almost perfect” (ICC >0.80) for 28/29 conditions (97%). Fine-tuned models were more accurate than non-fine-tuned models (MAD difference –0.24 [95% CI, –0.34, –0.14]). Conditions with exemplars were more accurate than those without (MAD difference –0.44 [CI, –0.57, –0.31]). Accuracy progressively decreased as 6, 3, 1, and 0 rubric levels were explicitly defined in the prompt (P<.001). Contrary to hypotheses, accuracies for chain-of-thought prompts and variations in temperature and persona were not significantly different from the baseline prompt. Reproducibility ICC was >0.80 for 28/29 conditions (97%).

Discussion Automated LLM essay scoring demonstrated near-perfect accuracy and reproducibility for most prompt-model conditions. Fine-tuned models and prompts with exemplars had higher accuracy but higher cost. Fine-tuned models had lower per-essay costs for larger essay volumes. For smaller volumes, non-fine-tuned GPT-4.1 provided excellent results at moderate cost. GPT-4.1-mini provided very good results at low cost.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

View original article

Medrxiv - Medical Education

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

Comments (0)