Evaluating the Performance of Seven Large Language Models (GPT4.5, Gemini, Copilot, Claude, Perplexity, DeepSeek, and Manus) in Answering Healthcare Quality Management Inquiries
Keywords:
AI, artificial intelligence, LLMs, healthcare quality management, education tools, multiple-choice questionsAbstract
Large language models (LLMs) are increasingly utilized across education, healthcare, and decision support due to their advanced text processing capabilities. This study evaluated the performance of seven LLMs: ChatGPT4.5, Gemini 2.5 Pro, Copilot, Claude 3.7, Perplexity, DeepSeek, and Manus in answering multiple-choice questions related to healthcare quality management. The assessment included 20 validated questions across four domains: organizational leadership (n = 5), health data analytics (n = 5), performance improvement (n = 5), and patient safety (n = 5). Accuracy rates ranged from 70% to 80%, with ChatGPT4.5, Gemini, and Claude achieving 80%; Perplexity and Manus, 75%; and Copilot and DeepSeek, 70%. All models met or exceeded the predefined accuracy threshold of 70%. Descriptive statistics showed a mean of 15.19 correct responses (SD = 0.83) and 5.00 incorrect responses (SD = 0.85) per model, with a combined average of 12.71 responses (SD = 4.46). A Pearson chi-square test indicated no statistically significant differences in accuracy among the models, χ² (6, N = 140) = 1.321, P =.971. A Monte Carlo simulation with 10,000 sampled tables confirmed this result (P =.984, 95% CI).
The findings indicated comparable performance across the evaluated AI models in the context of healthcare quality education. These results support the use of large language models as supplementary tools in this domain, while highlighting the need for further evaluation of performance across specific content domains and their applicability in real-world professional training contexts.