Evaluating the Performance of Seven Large Language Models ‎‎(GPT4.5, Gemini, Copilot, Claude, ‎Perplexity, DeepSeek, and Manus) ‎in Answering Healthcare Quality Management Inquiries

Dr. Mohammed Sallam; Dr. Johan Snygg; Dr. Ahmad Hamdan; Dr. Doaa Allam; Dr. Rana Kassem; Dr. Mais Damani

Authors

Dr. Mohammed Sallam Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates; Department of Management, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates; Department of Management, School of Business, International American University, Los Angeles, CA 90010, ‎United States of America; College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai P.O. ‎Box 505055, United Arab Emirates
Dr. Johan Snygg Department of Management, Mediclinic City Hospital, Mediclinic Middle East, Dubai P.O. Box ‎‎505004, ‎United Arab Emirates; Department of Anesthesia and Intensive Care, University of Gothenburg, Sahlgrenska Academy, 41345 ‎Gothenburg, ‎Sweden
Dr. Ahmad Hamdan Department of Management, School of Business, International American University, Los Angeles, CA 90010, ‎United States of America; Department of Nursing, Mediclinic Welcare Hospital, Mediclinic Middle East, Dubai P.O. Box 31500, ‎United Arab Emirates
Dr. Doaa Allam Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates
Dr. Rana Kassem Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates
Dr. Mais Damani Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates

Keywords:

AI, artificial intelligence, LLMs, healthcare quality management, education tools, ‎multiple-choice ‎questions

Abstract

Large language models (LLMs) are increasingly utilized across education, healthcare, and decision support due to their advanced text processing capabilities. This study evaluated the performance of seven LLMs: ChatGPT4.5, Gemini 2.5 Pro, Copilot, Claude 3.7, Perplexity, DeepSeek, and Manus in answering multiple-choice questions related to healthcare quality management. The assessment included 20 validated questions across four domains: organizational leadership (n = 5), health data analytics (n = 5), performance improvement (n = 5), and patient safety (n = 5). Accuracy rates ranged from 70% to 80%, with ChatGPT4.5, Gemini, and Claude achieving 80%; Perplexity and Manus, 75%; and Copilot and DeepSeek, 70%. All models met or exceeded the predefined accuracy threshold of 70%. Descriptive statistics showed a mean of 15.19 correct responses (SD = 0.83) and 5.00 incorrect responses (SD = 0.85) per model, with a combined average of 12.71 responses (SD = 4.46). A Pearson chi-square test indicated no statistically significant differences in accuracy among the models, χ² (6, N = 140) = 1.321, P =.971. A Monte Carlo simulation with 10,000 sampled tables confirmed this result (P =.984, 95% CI).

The findings indicated comparable performance across the evaluated AI models in the context of healthcare quality education. These results support the use of large language models as supplementary tools in this domain, while highlighting the need for further evaluation of performance across specific content domains and their applicability in real-world professional training contexts.

Evaluating the Performance of Seven Large Language Models ‎‎(GPT4.5, Gemini, Copilot, Claude, ‎Perplexity, DeepSeek, and Manus) ‎in Answering Healthcare Quality Management Inquiries

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

cover

Make a Submission