Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 27, 2024
Open Peer Review Period: Jun 28, 2024 - Aug 23, 2024
Date Accepted: Dec 20, 2024
(closed for review but you can still tweet)
Qwen-2.5 Outperforms Other Large Language Models in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Analysis
ABSTRACT
Background:
Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical exams. However, their potential application in the CNNLE remains unexplored.
Objective:
This study aims to evaluates the accuracy of seven LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. And we also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.
Methods:
This retrospective cross-sectional study analyzed all 1,200 multiple-choice questions (MCQs) from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these MCQs, and nine machine learning models including Logistic Regression, Support Vector Machine (SVM), Multilayer Perceptron (MLP), k-Nearest Neighbors (KNN), Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were employed to optimize overall performance through ensemble techniques.
Results:
Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared to the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the seven LLMs were combined using nine machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an Area Under the Curve (AUC) of 0.961, sensitivity of 0.905, specificity of 0.978, F1 score of 0.901, Positive Predictive Value (PPV) of 0.901, and Negative Predictive Value (NPV) of 0.977.
Conclusions:
This study is the first to evaluate the performance of seven LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing healthcare education and call for further research to refine their capabilities and expand their impact on exam preparation and professional training. Clinical Trial: Not applicable.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.