Prompt engineering shapes diagnostic accuracy and explanation quality of LLM in oral lesion diagnosis: a prospective, expert-blinded benchmark study

Hassanein FEA, Hussein RR, Elgarhy MR, Maher SM, Hassen A, Heidar S, Ezz El Arab M, Edress A, Abou-Bakr A, Mekhemar M. Artificial intelligence versus human dental expertise in diagnosing periapical pathosis on periapical radiographs: a multicenter study. Bioengineering. 2026;13(2):232.

Article PubMed PubMed Central Google Scholar

Hassanein F. Evaluating multimodal large language models for clinical diagnosis of oral lesions: a biomedical informatics perspective. 2025.

Almohareb T, Abou-Bakr A, Hassanein FEA, Ahmed Y, Hamza M, Aboheikal M, Nagi N. Clinical and patient comparison of AI and expert digital smile design: a prospective paired study. Dent J. 2026;14(3):166.

Article Google Scholar

Ras AA, Kheir El Din NH, Talaat AM, Hussein RR, Khalil E. Mucocutaneous changes in end-stage renal disease under regular hemodialysis—a cross-sectional study. Indian J Dent Res. 2023;34(2):130–5.

Article PubMed Google Scholar

Ghalwash D, Ammar A, Abou-Bakr A, Diab AH, El-Gawish A. Validation of salivary proteomic biomarkers for early detection of oral cancer in the Egyptian population. Future Sci OA. 2025;11(1):2432222.

Article PubMed Google Scholar

Ghalwash D, El-Gawish A, Ammar A, Hamdy A, Ghanem R, Ghanem M, et al. Epidemiology of Sjogren’s syndrome in a sample of the Egyptian population: a cross-sectional study. J Int Med Res. 2024;52(10):3000605241289292.

Article CAS PubMed PubMed Central Google Scholar

Abou-Bakr A, Hassanein FEA. Comment on “Diagnostic Performance of Multimodal Large Language Models in the Analysis of Oral Pathology”. Oral Dis. 2026. https://doi.org/10.1111/odi.70216.

Article PubMed Google Scholar

Schwendicke F, Samek W, Krois J. Artificial intelligence in dentistry: chances and challenges. J Dent Res. 2020;99(7):769–74.

Article CAS PubMed PubMed Central Google Scholar

Abou-Bakr A, Eissa AA, Alshikh B, Ahmed Y, AbuShady EF, Tassoker M, et al. Comparative diagnostic accuracy of ChatGPT models in salivary gland disease: a multimodal vignette-based evaluation. Eur Arch Otorhinolaryngol. 2025. https://doi.org/10.1007/s00405-025-09925-5.

Article PubMed Google Scholar

Robaian A, Hassanein FEA, Hassan MT, Alqahtani AS, Abou-Bakr A. A multimodal large language model framework for clinical subtyping and malignant transformation risk prediction in oral lichen planus: a paired comparison with expert clinicians. Int Dent J. 2026;76(1):109357.

Article PubMed Google Scholar

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A et al: Training language models to follow instructions with human feedback (2022). ArXiv: abs/2203.02155.

McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94.

Article CAS PubMed Google Scholar

Hassanein FEA, Hussein RR, Almalahy HG, Sarhan S, Ahmed Y, Abou-Bakr A. Vision-based diagnostic gain of ChatGPT-5 and gemini 2.5 pro compared with human experts in oral lesion assessment. Sci Rep. 2025;15(1):43279.

Article CAS PubMed PubMed Central Google Scholar

AlFarabi Ali S, AlDehlawi H, Jazzar A, Ashi H, Esam Abuzinadah N, AlOtaibi M, et al. The diagnostic performance of large language models and oral medicine consultants for identifying oral lesions in text-based clinical scenarios: prospective comparative study. JMIR AI. 2025;4:e70566.

Article PubMed PubMed Central Google Scholar

Grinberg N, Whitefield S, Kleinman S, Ianculovici C, Wasserman G, Peleg O. Artificial intelligence differential diagnosis of soft-tissue oral lesions using ChatGPT. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139(2):e54–5.

Article Google Scholar

Grinberg N, Whitefield S, Kleinman S, Ianculovici C, Wasserman G, Peleg O. Assessing the performance of an artificial intelligence based chatbot in the differential diagnosis of oral mucosal lesions: clinical validation study. Clin Oral Investig. 2025;29(4):188.

Article PubMed Google Scholar

Abou-Bakr A, El Barbary A, Hassanein FEA: ChatGPT-5 vs oral medicine experts for rank-based differential diagnosis of oral lesions: a prospective, biopsy-validated comparison. Odontology. 2025.

Hassanein FEA, Hussein RR, Ahmed Y, El-Guindy J, Ahmed DE, Abou-Bakr A. Calibration of AI large language models with human subject matter experts for grading of clinical short-answer responses in dental education. BMC Oral Health. 2026;26(1):286.

Article PubMed PubMed Central Google Scholar

Hassanein FEA, Ahmed Y, Maher S, Barbary AE, Abou-Bakr A. Prompt-dependent performance of multimodal AI model in oral diagnosis: a comprehensive analysis of accuracy, narrative quality, calibration, and latency versus human experts. Sci Rep. 2025;15(1):37932.

Article CAS PubMed PubMed Central Google Scholar

Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11:e48808.

Article PubMed PubMed Central Google Scholar

Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D: Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022. New Orleans, LA, USA: Curran Associates Inc.

Vaira LA, Lechien JR, Maniaci A, De Vito A, Mayo-Yáñez M, Troise S, et al. Diagnostic performance of ChatGPT-4o in analyzing oral mucosal lesions: a comparative study with experts. Medicina Kaunas. 2025;61(8):1379.

Article PubMed PubMed Central Google Scholar

Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med Inform. 2024;12:e55318.

Article PubMed PubMed Central Google Scholar

Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC: Domain Generalization: A Survey, 2021.

Reise S. The rediscovery of bifactor measurement models (vol 47, pg 667, 2012). Multivar Behav Res. 2013;48:461–461.

Google Scholar

Bishop C: Pattern recognition and machine learning. In. Vol 16, edn.; 2006. pp. 140–155.

Sounderajah V, Ashrafian H, Aggarwal R, De Fauw J, Denniston AK, Greaves F, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI steering group. Nat Med. 2020;26(6):807–8.

Article CAS PubMed Google Scholar

Glick M, Greenberg MS, Lockhart PB, Challacombe SJ. Burket’s Oral Medicine. Wiley; 2021.

Book Google Scholar

Neville BW, Damm DD, Allen CM, Chi AC: Oral and maxillofacial pathology: Elsevier Health Sciences. 2015.

Hendrycks D, Dietterich T: Benchmarking neural network robustness to common corruptions and perturbations; 2019.

Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC. Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):4396–415.

PubMed Google Scholar

Bishop CM: Pattern recognition and machine learning. New York: Springer; 2006.

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37.

Google Scholar

Jin Q, Chen F, Zhou Y, Xu Z, Cheung JM, Chen R, et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit Med. 2024;7(1):190.

Article PubMed PubMed Central Google Scholar

Chen P, Huang Z, Deng Z, Li T, Su Y, Wang H, Ye J, Qiao Y, He J: Enhancing medical task performance in gpt-4v: a comprehensive study on prompt engineering strategies; 2023. arXiv: 231204344.

Vaira LA, Lechien JR, Abbate V, Gabriele G, Frosolini A, De Vito A, et al. Enhancing AI chatbot responses in health care: the SMART prompt structure in head and neck surgery. OTO Open. 2025;9(1):e70075.

Article PubMed PubMed Central Google Scholar

Renze M, Guven E: Self-reflection in llm agents: Effects on problem-solving performance; 2024. arXiv:240506682.

Alam L, Mueller ST. Examining physicians’ explanatory reasoning in re-diagnosis scenarios for improving AI diagnostic systems. J Cogn Eng Decis Mak. 2022;16(2):63–78.

Article Google Scholar

Nishida N, Yamakawa M, Shiina T, Mekada Y, Nishida M, Sakamoto N, et al. Artificial intelligence (AI) models for the ultrasonographic diagnosis of liver tumors and comparison of diagnostic accuracies between AI and human experts. J Gastroenterol. 2022;57(4):309–21.

Article PubMed PubMed Central Google Scholar

Chan PZ, Ramli MAIB, Chew HSJ. Diagnostic test accuracy of artificial intelligence-assisted detection of acute coronary syndrome: a systematic review and meta-analysis. Comput Biol Med. 2023;167:107636.

Article PubMed Google Scholar

Sounderajah V, Ashrafian H, Golub RM, Shetty S, De Fauw J, Hooft L, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11(6):e047709.

Article PubMed PubMed Central Google Scholar

View original article

ODONTOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Prompt engineering shapes diagnostic accuracy and explanation quality of LLM in oral lesion diagnosis: a prospective, expert-blinded benchmark study

Comments (0)