Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

ABSTRACT

Importance Guideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement.

Objective To evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review.

Design, Setting, and Participants This retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with ≥ 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations.

Exposures Assessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence.

Main Outcomes and Measures Model performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework).

Results All three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit.

Conclusions and Relevance LLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.

Competing Interest Statement

Dr. Bannett reported receiving personal fees from the National Opinion Research Center consulting outside the submitted work. The authors have no other conflicts of interest to disclose.

Funding Statement

This work was supported by the Stanford Maternal and Child Health Research Institute and by grant number K23MH128455 from the National Institute of Mental Health of the National Institutes of Health (Dr. Bannett).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Institutional Review Board of Stanford University School of Medicine gave ethical approval for this work

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

Conflict of Interest Disclosures (includes financial disclosures): Dr. Bannett reported receiving personal fees from the National Opinion Research Center consulting outside the submitted work. The authors have no other conflicts of interest to disclose.

Funding/Support: This work was supported by the Stanford Maternal and Child Health Research Institute and by grant number K23MH128455 from the National Institute of Mental Health of the National Institutes of Health (Dr. Bannett).

Role of Funder/Sponsor: The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funders did not have any part in design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Data Availability Statement: The datasets generated and analyzed in the current study contain protected patient health information and are therefore not publicly available.

Data Availability

The datasets generated and analyzed in the current study contain protected patient health information and are therefore not publicly available.

View original article

Medrxiv - Pediatrics

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

Comments (0)