Are generative artificial intelligence chatbots capable of successfully passing the pulmonology license examination?

Poberezhets, V. L.; Radohoshchyn, I. O.; Побережець, В. Л.; Радогощин, І. О.

Домівка
→
Кафедра пропедевтики внутрішньої медицини
→
Наукові публікації каф. пропедевтики внутрішньої медицини
→
Перегляд матеріалів

Are generative artificial intelligence chatbots capable of successfully passing the pulmonology license examination?

Poberezhets, V. L.; Radohoshchyn, I. O.; Побережець, В. Л.; Радогощин, І. О.

URI: https://dspace.vnmu.edu.ua/123456789/10548

Дата: 2025

Короткий опис (реферат):

Since 2022, generative artificial intelligence (AI) chatbots have been rapidly integrated into various professional domains, including healthcare. Medicine, including specialties such as pulmonology, has also adopted these technologies, with generative AI demonstrating potential in interpreting imaging, explaining spirometry results, and supporting clinical decision-making and medical education. However, it is still debatable whether generative AI models can come close to the results of human physicians in official medical licensing testing. Objective: To evaluate the performance of generative AI chatbots in answering pulmonology certification examination questions. Materials and Methods: In December 2024, we presented examination tests from the database of questions for the certification of pulmonologists to the most widely used in Ukraine free chatbots with generative AI — ChatGPT version 3.5, Microsoft Copilot, and Gemini. These chatbots were instructed to answer 1095 test questions from the general database, after which the answers to questions about bronchial asthma (92 questions) and allergies (35 questions) were analysed. Results: The accuracy of ChatGPT in solving pulmonary tests was 95 % (n = 1037 correct answers), Microsoft Copilot — 92 % (n = 1008 correct answers), and Gemini — 81 % (n = 890 correct answers). For questions about the diagnosis and treatment of allergies, Microsoft Copilot showed the best accuracy with 100 % correct answers (n = 35); ChatGPT scored 94.3 % correct answers (n = 33), and Gemini — 85.7 % correct answers (n = 30). ChatGPT correctly answered the question about bronchial asthma in 91.3 % of cases (n = 84), Gemini — 79.4 % (n = 73), and Copilot — 89.1 % (n = 82). All chatbots performed better on questions with a single correct answer compared to those with multiple correct answers: ChatGPT — 92.9 % vs. 75 %, Gemini — 83.3 % vs. 37.5 %, Copilot — 94 % vs. 37.5 % of correct answers. Conclusions: Our research has shown that generative AI chatbots demonstrated high performance in solving the examination test for pulmonology certification, which can be considered a passing grade for a medical doctor in the respiratory field. In particular, this applies to the questions related to bronchial asthma and allergies. ChatGPT demonstrated the best accuracy, answering 95 % of all tests correctly. It was found that generative AI was significantly better at solving questions with a single correct answer compared to questions with multiple correct answers.

Показати повний опис матеріалу