Background Large language models (LLMs) perform well on standardized medical exam questions, but their reliability for complex hematology decision making is uncertain. We compared four general-purpose LLMs (GPT-4o, GPT-o3, Claude Sonnet 4, and DeepSeek-V3) with a Virtual MDS Panel (VMP), a coordinated multi-agent AI system in which domain-specialized, rule-bound software agents (WHO/ICC guidelines; IPSS-R/IPSS-M; NCCN) collaborate to generate tumor-board-level recommendations.
Methods Each model generated diagnostic, prognostic, and treatment recommendations for 30 myelodysplastic syndrome cases. Nine international MDS experts from five institutions, blinded to model identity, completed 3,000 structured ratings using 5-point Likert scales for diagnosis, prognosis, and therapy and classified errors by severity.
Results General-purpose LLMs achieved modest expert ratings (overall mean scores: 3.7 for GPT-o3, 3.2 for GPT-4o, 3.1 for DeepSeek, and 3.0 for Claude) and contained major factual errors in at least 24% of responses. The VMP increased the proportion of outputs rated 4 or higher to 87% (vs. 34-66% for general-purpose models), improved mean scores to 4.3 overall (4.3 for diagnosis, 4.4 for prognosis, and 4.1 for therapy), and reduced major errors to 8%.
Conclusions In this blinded evaluation of 30 complex MDS cases, general-purpose LLMs produced clinically important errors at rates that raise safety concerns for autonomous hematology decision making. The VMP, a rule-bound, multi-agent architecture, approached expert-level accuracy supporting its potential role as an effective decision-support tool for MDS in the future.
Competing Interest StatementCOI notes: BMA: Receives royalty payments related to venetoclax from the Walter and Eliza Hall Institute of Medical Research (Melbourne, Australia). RB: Research funding from Bristol Myers Squibb and Taiho; advisory board honoraria from Bristol Myers Squibb, Taiho, AbbVie, and Takeda; has served on trial steering committees for Takeda (formerly Keros) and Bristol Myers Squibb; and is a member of the Scientific Advisory Board of MDS-F. AMB: Reports consulting fees from Novartis, AbbVie, Agios, Bristol Myers Squibb, Geron, i-Mab, Keros Therapeutics, Lava Therapeutics, Rigel, Sanofi, Syndax, Servier, and Takeda. AC: No conflicts of interest to disclose. AED: Has served as a consultant and/or in advisory roles for Bristol Myers Squibb, Novartis, Geron, Taiho, Keros, Agios, Takeda, UpToDate, CVS, and DynaMed. JTE: Receives honoraria from Taiho, GSK, and Novartis. TH: No conflicts of interest to disclose. TK: No conflicts of interest to disclose. AN: No conflicts of interest to disclose. MGR: No conflicts of interest to disclose. GR: No conflicts of interest to disclose. VS: Has served on advisory boards for Ascentage, Bristol Myers Squibb, Geron, GSK, Jazz, Novartis, Servier, Pfizer, Alexion, Faron, and Takeda. MAS: Has served on advisory boards for Bristol Myers Squibb, Rigel, Geron, and Agios. MS: Served on advisory boards for Novartis, Kymera, Sierra Oncology, GSK, Rigel, Bristol Myers Squibb, Sobi, Syndax, Kura, and Servier; consulted for Boston Consulting Group, GLG, and The Dedham Group; participated in CME activities for Novartis, Curis Oncology, Haymarket Media, and Clinical Care Options; and is a member of the Medical Safety Monitoring Board for Takeda Pharmaceuticals. MDP: No conflicts of interest to disclose. DMS: Has served in consulting and/or advisory roles for Bristol Myers Squibb, Daiichi Sankyo, Geron, MorphoSys, and Syndax, and has participated in speakers bureaus for Bristol Myers Squibb, GSK, and Servier. SV: No conflicts of interest to disclose. JW: No conflicts of interest to disclose. AMZ: Has participated in advisory boards, consulted, served on clinical trial committees, and/or received honoraria from AbbVie, Akesobio, Agios, Amgen, Astellas, BioCryst, Beigene, Boehringer Ingelheim, Celgene/Bristol Myers Squibb, Chiesi/Cornerstone Biopharma, Daiichi Sankyo, Dr. Reddy's, Epizyme, Faron, FibroGen, GSK, GlycoMimetics, Genentech, Gilead, Geron, Janssen, Jasper, Karyopharm, Kyowa Kirin, Keros, Kura, Novartis, Notable, Orum, Otsuka, Pfizer, Regeneron, Rigel, Seattle Genetics, Shattuck Labs, Schroedinger, Syros, Syndax, Servier, Takeda, Treadwell, Taiho, Vincerx, and Zentalis.
Funding StatementThis study did not receive any funding.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityData Sharing Statement: Deidentified study data and analytic code will be shared upon reasonable request to the corresponding author, please contact dswobodatgh.org
Comments (0)