Prompt engineering shapes diagnostic accuracy and explanation quality of LLM in oral lesion diagnosis: a prospective, expert-blinded benchmark study

Hassanein FEA, Hussein RR, Elgarhy MR, Maher SM, Hassen A, Heidar S, Ezz El Arab M, Edress A, Abou-Bakr A, Mekhemar M. Artificial intelligence versus human dental expertise in diagnosing periapical pathosis on periapical radiographs: a multicenter study. Bioengineering. 2026;13(2):232.

Article  PubMed  PubMed Central  Google Scholar 

Hassanein F. Evaluating multimodal large language models for clinical diagnosis of oral lesions: a biomedical informatics perspective. 2025.

Almohareb T, Abou-Bakr A, Hassanein FEA, Ahmed Y, Hamza M, Aboheikal M, Nagi N. Clinical and patient comparison of AI and expert digital smile design: a prospective paired study. Dent J. 2026;14(3):166.

Article  Google Scholar 

Ras AA, Kheir El Din NH, Talaat AM, Hussein RR, Khalil E. Mucocutaneous changes in end-stage renal disease under regular hemodialysis—a cross-sectional study. Indian J Dent Res. 2023;34(2):130–5.

Article  PubMed  Google Scholar 

Ghalwash D, Ammar A, Abou-Bakr A, Diab AH, El-Gawish A. Validation of salivary proteomic biomarkers for early detection of oral cancer in the Egyptian population. Future Sci OA. 2025;11(1):2432222.

Article  PubMed  Google Scholar 

Ghalwash D, El-Gawish A, Ammar A, Hamdy A, Ghanem R, Ghanem M, et al. Epidemiology of Sjogren’s syndrome in a sample of the Egyptian population: a cross-sectional study. J Int Med Res. 2024;52(10):3000605241289292.

Article  CAS  PubMed  PubMed Central  Google Scholar 

Abou-Bakr A, Hassanein FEA. Comment on “Diagnostic Performance of Multimodal Large Language Models in the Analysis of Oral Pathology”. Oral Dis. 2026. https://doi.org/10.1111/odi.70216.

Article  PubMed  Google Scholar 

Schwendicke F, Samek W, Krois J. Artificial intelligence in dentistry: chances and challenges. J Dent Res. 2020;99(7):769–74.

Article  CAS  PubMed  PubMed Central  Google Scholar 

Abou-Bakr A, Eissa AA, Alshikh B, Ahmed Y, AbuShady EF, Tassoker M, et al. Comparative diagnostic accuracy of ChatGPT models in salivary gland disease: a multimodal vignette-based evaluation. Eur Arch Otorhinolaryngol. 2025. https://doi.org/10.1007/s00405-025-09925-5.

Article  PubMed  Google Scholar 

Robaian A, Hassanein FEA, Hassan MT, Alqahtani AS, Abou-Bakr A. A multimodal large language model framework for clinical subtyping and malignant transformation risk prediction in oral lichen planus: a paired comparison with expert clinicians. Int Dent J. 2026;76(1):109357.

Article  PubMed  Google Scholar 

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A et al: Training language models to follow instructions with human feedback (2022). ArXiv: abs/2203.02155.

McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94.

Article  CAS  PubMed  Google Scholar 

Hassanein FEA, Hussein RR, Almalahy HG, Sarhan S, Ahmed Y, Abou-Bakr A. Vision-based diagnostic gain of ChatGPT-5 and gemini 2.5 pro compared with human experts in oral lesion assessment. Sci Rep. 2025;15(1):43279.

Article  CAS  PubMed  PubMed Central  Google Scholar 

AlFarabi Ali S, AlDehlawi H, Jazzar A, Ashi H, Esam Abuzinadah N, AlOtaibi M, et al. The diagnostic performance of large language models and oral medicine consultants for identifying oral lesions in text-based clinical scenarios: prospective comparative study. JMIR AI. 2025;4:e70566.

Article  PubMed  PubMed Central  Google Scholar 

Grinberg N, Whitefield S, Kleinman S, Ianculovici C, Wasserman G, Peleg O. Artificial intelligence differential diagnosis of soft-tissue oral lesions using ChatGPT. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139(2):e54–5.

Article  Google Scholar 

Grinberg N, Whitefield S, Kleinman S, Ianculovici C, Wasserman G, Peleg O. Assessing the performance of an artificial intelligence based chatbot in the differential diagnosis of oral mucosal lesions: clinical validation study. Clin Oral Investig. 2025;29(4):188.

Article  PubMed  Google Scholar 

Abou-Bakr A, El Barbary A, Hassanein FEA: ChatGPT-5 vs oral medicine experts for rank-based differential diagnosis of oral lesions: a prospective, biopsy-validated comparison. Odontology. 2025.

Hassanein FEA, Hussein RR, Ahmed Y, El-Guindy J, Ahmed DE, Abou-Bakr A. Calibration of AI large language models with human subject matter experts for grading of clinical short-answer responses in dental education. BMC Oral Health. 2026;26(1):286.

Article  PubMed  PubMed Central  Google Scholar 

Hassanein FEA, Ahmed Y, Maher S, Barbary AE, Abou-Bakr A. Prompt-dependent performance of multimodal AI model in oral diagnosis: a comprehensive analysis of accuracy, narrative quality, calibration, and latency versus human experts. Sci Rep. 2025;15(1):37932.

Article  CAS  PubMed  PubMed Central  Google Scholar 

Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11:e48808.

Article  PubMed  PubMed Central  Google Scholar 

Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D: Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022. New Orleans, LA, USA: Curran Associates Inc.

Vaira LA, Lechien JR, Maniaci A, De Vito A, Mayo-Yáñez M, Troise S, et al. Diagnostic performance of ChatGPT-4o in analyzing oral mucosal lesions: a comparative study with experts. Medicina Kaunas. 2025;61(8):1379.

Article  PubMed  PubMed Central  Google Scholar 

Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med Inform. 2024;12:e55318.

Article  PubMed  PubMed Central  Google Scholar 

Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC: Domain Generalization: A Survey, 2021.

Reise S. The rediscovery of bifactor measurement models (vol 47, pg 667, 2012). Multivar Behav Res. 2013;48:461–461.

Google Scholar 

Bishop C: Pattern recognition and machine learning. In. Vol 16, edn.; 2006. pp. 140–155.

Sounderajah V, Ashrafian H, Aggarwal R, De Fauw J, Denniston AK, Greaves F, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI steering group. Nat Med. 2020;26(6):807–8.

Article  CAS  PubMed  Google Scholar 

Glick M, Greenberg MS, Lockhart PB, Challacombe SJ. Burket’s Oral Medicine. Wiley; 2021.

Book  Google Scholar 

Neville BW, Damm DD, Allen CM, Chi AC: Oral and maxillofacial pathology: Elsevier Health Sciences. 2015.

Hendrycks D, Dietterich T: Benchmarking neural network robustness to common corruptions and perturbations; 2019.

Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC. Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):4396–415.

PubMed  Google Scholar 

Bishop CM: Pattern recognition and machine learning. New York: Springer; 2006.

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37.

Google Scholar 

Jin Q, Chen F, Zhou Y, Xu Z, Cheung JM, Chen R, et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit Med. 2024;7(1):190.

Article  PubMed  PubMed Central  Google Scholar 

Chen P, Huang Z, Deng Z, Li T, Su Y, Wang H, Ye J, Qiao Y, He J: Enhancing medical task performance in gpt-4v: a comprehensive study on prompt engineering strategies; 2023. arXiv: 231204344.

Vaira LA, Lechien JR, Abbate V, Gabriele G, Frosolini A, De Vito A, et al. Enhancing AI chatbot responses in health care: the SMART prompt structure in head and neck surgery. OTO Open. 2025;9(1):e70075.

Article  PubMed  PubMed Central  Google Scholar 

Renze M, Guven E: Self-reflection in llm agents: Effects on problem-solving performance; 2024. arXiv:240506682.

Alam L, Mueller ST. Examining physicians’ explanatory reasoning in re-diagnosis scenarios for improving AI diagnostic systems. J Cogn Eng Decis Mak. 2022;16(2):63–78.

Article  Google Scholar 

Nishida N, Yamakawa M, Shiina T, Mekada Y, Nishida M, Sakamoto N, et al. Artificial intelligence (AI) models for the ultrasonographic diagnosis of liver tumors and comparison of diagnostic accuracies between AI and human experts. J Gastroenterol. 2022;57(4):309–21.

Article  PubMed  PubMed Central  Google Scholar 

Chan PZ, Ramli MAIB, Chew HSJ. Diagnostic test accuracy of artificial intelligence-assisted detection of acute coronary syndrome: a systematic review and meta-analysis. Comput Biol Med. 2023;167:107636.

Article  PubMed  Google Scholar 

Sounderajah V, Ashrafian H, Golub RM, Shetty S, De Fauw J, Hooft L, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11(6):e047709.

Article  PubMed  PubMed Central  Google Scholar 

Comments (0)

No login
gif