Evaluating the Performance of Seven Large Language Models ‎‎(GPT4.5, Gemini, Copilot, Claude, ‎Perplexity, DeepSeek, and Manus) ‎in Answering Healthcare Quality Management Inquiries

Authors

  • Dr. Mohammed Sallam Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates; Department of Management, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates; Department of Management, School of Business, International American University, Los Angeles, CA 90010, ‎United States of America; College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai P.O. ‎Box 505055, United Arab Emirates
  • Dr. Johan Snygg Department of Management, Mediclinic City Hospital, Mediclinic Middle East, Dubai P.O. Box ‎‎505004, ‎United Arab Emirates; Department of Anesthesia and Intensive Care, University of Gothenburg, Sahlgrenska Academy, 41345 ‎Gothenburg, ‎Sweden
  • Dr. Ahmad Hamdan Department of Management, School of Business, International American University, Los Angeles, CA 90010, ‎United States of America; Department of Nursing, Mediclinic Welcare Hospital, Mediclinic Middle East, Dubai P.O. Box 31500, ‎United Arab Emirates
  • Dr. Doaa Allam Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates
  • Dr. Rana Kassem Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates
  • Dr. Mais Damani Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, ‎United Arab Emirates

Keywords:

AI, artificial intelligence, LLMs, healthcare quality management, education tools, ‎multiple-choice ‎questions

Abstract

Large language models (LLMs) are increasingly utilized across education, healthcare, and decision support due to their advanced text processing capabilities. This study evaluated the performance of seven LLMs: ChatGPT4.5, Gemini 2.5 Pro, Copilot, Claude 3.7, Perplexity, DeepSeek, and Manus in answering multiple-choice questions related to healthcare quality management. The assessment included 20 validated questions across four domains: organizational leadership (n = 5), health data analytics (n = 5), performance improvement (n = 5), and patient safety (n = 5). Accuracy rates ranged from 70% to 80%, with ChatGPT4.5, Gemini, and Claude achieving 80%; Perplexity and Manus, 75%; and Copilot and DeepSeek, 70%. All models met or exceeded the predefined accuracy threshold of 70%. Descriptive statistics showed a mean of 15.19 correct responses (SD = 0.83) and 5.00 incorrect responses (SD = 0.85) per model, with a combined average of 12.71 responses (SD = 4.46). A Pearson chi-square test indicated no statistically significant differences in accuracy among the models, χ² (6, N = 140) = 1.321, P =.971. A Monte Carlo simulation with 10,000 sampled tables confirmed this result (P =.984, 95% CI).

The findings indicated comparable performance across the evaluated AI models in the context of healthcare quality education. These results support the use of large language models as supplementary tools in this domain, while highlighting the need for further evaluation of performance across specific content domains and their applicability in real-world professional training contexts.

Downloads

Published

2025-05-29

How to Cite

Dr. Mohammed Sallam, Dr. Johan Snygg, Dr. Ahmad Hamdan, Dr. Doaa Allam, Dr. Rana Kassem, & Dr. Mais Damani. (2025). Evaluating the Performance of Seven Large Language Models ‎‎(GPT4.5, Gemini, Copilot, Claude, ‎Perplexity, DeepSeek, and Manus) ‎in Answering Healthcare Quality Management Inquiries. esearch and dvances in ducation, 4(4), 39–50. etrieved from https://www.paradigmpress.org/rae/article/view/1642

Issue

Section

Articles