Skip to main content

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty

Abstract

Background

Artificial intelligence (AI) chatbots have demonstrated proficiency in structured knowledge assessments; however, there is limited research on their performance in scenarios involving diagnostic uncertainty, which requires careful interpretation and complex decision-making. This study aims to evaluate the efficacy of AI chatbots, GPT-4o and Claude-3, in addressing medical scenarios characterized by diagnostic uncertainty relative to Family Medicine residents.

Methods

Questions with diagnostic uncertainty were extracted from the Progress Tests administered by the Department of Family and Community Medicine at the University of Toronto between 2022 and 2023. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced diagnostic reasoning and differential diagnosis. These questions were administered to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years and inputted into GPT-4o and Claude-3. Errors were categorized into statistical, information, and logical errors. Statistical analyses were conducted using a binomial generalized estimating equation model, paired t-tests, and chi-squared tests.

Results

Compared to the residents, both chatbots scored lower on diagnostic uncertainty questions (p < 0.01). PGY-1 residents achieved a correctness rate of 61.1% (95% CI: 58.4–63.7), and PGY-2 residents achieved 63.3% (95% CI: 60.7–66.1). In contrast, Claude-3 correctly answered 57.7% (n = 52/90) of questions, and GPT-4o correctly answered 53.3% (n = 48/90). Claude-3 had a longer mean response time (24.0 s, 95% CI: 21.0-32.5 vs. 12.4 s, 95% CI: 9.3–15.3; p < 0.01) and produced longer answers (2001 characters, 95% CI: 1845–2212 vs. 1596 characters, 95% CI: 1395–1705; p < 0.01) compared to GPT-4o. Most errors by GPT-4o were logical errors (62.5%).

Conclusions

While AI chatbots like GPT-4o and Claude-3 demonstrate potential in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents.

Peer Review reports

Introduction

In recent years, the potential benefits of artificial intelligence (AI) in healthcare have been extensively explored [1, 2]. Among the barriers faced by outpatients at specialist care centers, more than half experience issues related to information availability and healthcare communication [3]. The advent of rapidly developing chatbots, such as ChatGPT, has highlighted the utility of AI in medical information dissemination and early patient education. These chatbots, with their advanced fluency and technical linguistic capabilities, offer the general patient population a wealth of easily accessible and accurate information [4,5,6]. They deliver context with careful consideration, potentially mitigating the occasionally alarming nature of highlighted internet search results [7, 8]. AI has already demonstrated benefits in triage, providing diagnostic results comparable to those of clinicians and offering safer recommendations on average [9, 10]. Furthermore, the rise of telemedicine as a medium for patient management presents an additional dimension suitable for language models [11].

Nonetheless, the intricacies of real-world medical practice go beyond static knowledge and involve domains fraught with diagnostic uncertainty. Diagnostic uncertainty arises when symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced interpretation, differential diagnosis, and often, iterative patient evaluation [12, 13]. This aspect of medical practice poses challenges even for seasoned clinicians, demanding a synthesis of experience, intuition, and continuous learning [14]. Previous studies have demonstrated that ChatGPT performs well on structured medical knowledge assessments, including the United States Medical Licensing Exam (USMLE) [15,16,17,18,19]. However, there is a paucity of research evaluating the performance of AI chatbots in scenarios involving diagnostic uncertainty.

In addition, it is crucial to consider the distinct ethical frameworks and training methodologies that different AI chatbots employ, as these factors can significantly influence their responses. For instance, ChatGPT is programmed with several moral principles, including privacy, non-maleficence, non-discrimination, and transparency, while Claude is trained within a virtue ethics framework, which emphasizes honesty and a context-sensitive approach [20,21,22]. This latter framework could potentially allow for more nuanced and empathetic responses, particularly in complex scenarios such as those involving diagnostic uncertainty. This study aims to assess the efficacy of AI chatbots in addressing medical scenarios characterized by diagnostic uncertainty and to compare the responses of chatbots trained on different ethical frameworks. Understanding the constraints and capabilities of AI chatbots in managing diagnostic uncertainty is crucial for their effective integration into clinical practice.

Methods

Study design

The Progress Test, conducted by the Department of Family and Community Medicine (DFCM) at the University of Toronto functions as a formative tool to evaluate the development of residents towards becoming Family Medicine Experts and supports their preparation for Board Certification. This biannual examination is structured as a closed, four-hour multiple-choice test, curated by subject matter experts in Family Medicine. Each item on the test presents four response options, labeled A through D. For this study, all questions from four Progress Tests administered between 2022 and 2023 to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years that were tagged with the “diagnostic uncertainty” assessment objective, as highlighted by The College of Family Physicians in Canada, were extracted [23]. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced interpretation and differential diagnosis. The performance of medical residents (N = 320) on these questions was then compared against the performance of AI models GPT-4o and Claude-3 pro. Ethical approval for this study was granted by the University of Toronto Research Ethics Board.

Data collection

To maintain study integrity, each question was input into both GPT-4o and Claude-3 in the same format as presented in the official examination, with multiple-choice answers labeled A through D, without any alterations or additional cues. Prior to entering each question, the chatbots’ conversation history was reset, and memory cleared to avoid any influence from previous interactions. The chatbots’ responses were reviewed by two independent reviewers (R.S.H., A.B.) to identify the chosen multiple-choice options. Each LLM was queried with the same question three times to assess for variability. Collected data included the date of question input, response length in characters, response time in seconds, the presence of a rationale for excluding other options, and the root cause of any incorrect responses. If the AI chatbot selected “all of the above” or “none of the above,” the answer was marked incorrect since these were not valid choices.

For each question, it was documented whether the response provided reasons for excluding incorrect options. Incorrect responses were classified into three mutually exclusive types by the reviewers (R.S.H., A.B.): statistical errors, information errors, and logical errors. Statistical errors were defined as mistakes in arithmetic calculations. Information errors occurred when the chatbot gathered incorrect information either from the question itself or external sources, resulting in an incorrect answer. Logical errors were identified when the AI chatbot had access to the correct information but failed to apply it accurately to arrive at the correct answer.

Statistical analysis

The primary outcome of this study was to compare the performance of AI chatbots and PGY-1 and PGY-2 residents in answering questions involving diagnostic uncertainty. Secondary outcomes included comparing GPT-4o and Claude-3 performance, response length, response time, and the proportion of questions. Resident performance was calculated as an aggregate of the performance statistics on diagnostic uncertainty questions from the Family Medicine Progress Tests administered between 2022 and 2023, with 95% confidence intervals (CIs) derived using a binomial generalized estimating equation model. Chatbot performance was calculated based on the percentage of correct responses to the extracted questions. Analyses were stratified across each of the nine priority question areas. Paired t-tests were employed to compare means, and chi-squared tests were applied to compare proportions. We applied the Bonferroni correction method to control the family-wise error rate, ensuring that the significance level was appropriately maintained across the multiple comparisons [24]. A p-value threshold of 0.05 was set to determine statistical significance. Statistical analyses were conducted using Stata version 17.0 (StataCorp LLC, College Station, Texas).

Results

A total of ninety questions involving diagnostic uncertainty across nine categories within Family Medicine were included in the study selected from a total of 440 questions across four Progress Tests administered between 2022 and 2023 (Table 1). Overall, Claude-3 correctly answered 57.7% (n = 52/90) of the questions, while GPT-4o correctly answered 53.3% (n = 48/90) (Fig. 1). Both chatbots provided the same multiple-choice answer across all three trials for each question. The performance difference between the two chatbots was not statistically significant (p = 0.55). When comparing the performance of GPT-4o and Claude-3 to Family Medicine residents on diagnostic uncertainty questions, both chatbots underperformed relative to the residents. PGY-1 residents achieved an average correctness rate of 61.1% (95% CI: 58.4–63.7), and PGY-2 residents scored 63.3% (95% CI: 60.7–66.1), both significantly higher than the chatbots (p < 0.01). In specific categories, GPT-4o outperformed the residents in cardiovascular and gastrointestinal questions, with scores of 80% and 70%, respectively, compared to 64.5% and 65.7% among PGY-1 and PGY-2 residents (p < 0.01). Claude-3 excelled in geriatric care, mental health, and women’s health, scoring 70%, 80%, and 70%, respectively, outperforming the residents’ scores of 59.6%, 52.4%, and 56.4% (p < 0.01). Conversely, residents outperformed both chatbots in the endocrine, musculoskeletal, pediatric, and respiratory categories (p < 0.01).

Table 1 Comparison of GPT-4o and Claude-3 performance to Family Medicine residents on questions with diagnostic uncertainty
Fig. 1
figure 1

Resident and AI chatbot performance on diagnostic uncertainty questions

Claude-3 had a longer mean response time of 24.0 s (95% CI: 21.0-32.5) compared to GPT-4o, which had a mean response time of 12.4 s (95% CI: 9.3–15.3) (p < 0.01) (Table 2). In terms of response length, Claude-3 also produced longer answers, with a mean of 2001 characters (95% CI: 1845–2212) compared to GPT-4o’s 1596 characters (95% CI: 1395–1705) (p < 0.01). Both chatbots frequently provided rationales for other answer options, with Claude-3 doing so slightly more often than GPT-4o, although not statistically significant (86.7% vs. 78.9%; p = 0.17). Regarding the types of errors made, GPT-4o predominantly made logical errors, accounting for 62.5% of its mistakes, followed by information errors (18.8%) and statistical errors (18.8%). In contrast, Claude-3 had a lower proportion of logical errors at 44.7%, but higher rates of information errors (31.6%). An example of the output from GPT-4o and Claude-3 is provided in Table 3 and examples of errors are provided in Table 4.

Table 2 Comparison of response characteristics between GPT-4o and Claude-3
Table 3 Example GPT-4o and Claude-3 response to a diagnostic uncertainty question
Table 4 Example of logical, statistical, and information errors

Discussion

Our study compared the performance of AI chatbots, GPT-4o and Claude-3, against Family Medicine residents in addressing diagnostic uncertainty using questions from official progress examinations at the University of Toronto DFCM. Overall, both chatbots underperformed relative to the residents. Although Claude-3 generated longer and more rationale-rich responses, it was more prone to information errors compared to GPT-4o.

In a previous study examining chatbot performance on a Family Medicine Progress Test, ChatGPT demonstrated superior performance compared to the best-performing resident, highlighting its capability in handling well-defined medical knowledge assessments [16]. However, the results from our novel study, focusing solely on questions involving diagnostic uncertainty, reveal a significant shift in performance dynamics. Both GPT-4o and Claude-3 performed worse than first-year Family Medicine residents. This discrepancy underscores the heightened complexity and nuanced judgment required in scenarios characterized by diagnostic uncertainty, which current AI systems struggle to navigate effectively [25].

There are several plausible explanations for why AI systems struggle with this dimension of healthcare provision. Primarily, AI systems lack the contextual understanding required to appreciate the intricacies of modern medicine [26]. Their algorithms, trained on statistical patterns within limited data sets, are ill-suited to handle rare disease presentations, compounding illnesses, and conflicting clinical data [26, 27]. This bias towards trained data leads AI systems to fill gaps in information with assumptions, resulting in incomplete and incorrect diagnoses [28]. For instance, AI systems like GPT-4o have been found to prefer clinical diagnoses over pathological causes, such as selecting frontotemporal dementia over frontotemporal lobar degeneration, possibly influenced by the available training data [29]. The authenticity and quality of the training data used by these systems are of great consequence [30, 31]. The validity, diversity, and representativeness of the datasets included reinforce the decision-making capacity of the system when approaching rare and complex cases. Conversely, human physicians possess a wealth of experience regarding disease presentation, allowing them to consider individual circumstances, history, prevalence, and additional investigations to make a holistic diagnostic process [32]. This level of nuanced understanding is challenging to encode into an AI system.

Another critical consideration is that ChatGPT has been found incapable of recognizing and expressing uncertainty [33]. A cornerstone of modern medical practice and the training of medical practitioners is the risk assessment process, which involves calculating the probabilities of failure or complications while considering the patient’s comorbidities and leveraging these against the potential benefits of the intervention [33]. Salihu et al. (2024) describe seven cases where AI selected invasive treatments, whereas human physicians determined that medication would suffice [34]. These decisions were based on a complex array of considerations involving frailty, comorbidities, and life expectancy [34]. A similar finding emerged in our study, with ChatGPT recommending investigations when none were required. AI systems tend to answer decisively and confidently, often overestimating their confidence level regardless of the validity of their responses. ChatGPT was also found to be incapable of using low confidence levels to increase the number of unanswered questions in a sample exam designed to challenge its strategic capabilities [33]. This overconfidence may be considered a linguistic trait essential to the marketability of the system, but it underscores a significant concern regarding its integration into healthcare delivery.

The observed performance differences between GPT-4o and Claude-3 in specific medical domains can potentially be attributed to the distinct ethical frameworks and training methodologies employed for each AI system. GPT-4o performed better in areas such as cardiovascular and gastrointestinal health, possibly due to its programming with a predetermined set of moral principles, including privacy, non-maleficence, non-discrimination, and transparency [21]. These principles may guide GPT-4o towards clear, decisive answers in well-defined medical scenarios where established protocols and concrete data are available, as is often the case in cardiovascular and gastrointestinal health. Conversely, Claude excelled in mental health, women’s care, and geriatric care, which may be attributed to its training based on virtue ethics, emphasizing honesty and intention within a flexible, context-sensitive framework [20]. The nuanced and individualized nature of these domains likely benefits from the virtue ethics approach, which allows for more empathetic and contextually appropriate responses. Mental health, women’s care, and geriatric care often involve complex, subjective factors and require a deep understanding of the patient’s unique circumstances. Claude’s ethical framework may better equip it to navigate these complexities, providing more thoughtful and tailored responses. Consistent with the literature, the majority of ChatGPT’s errors were also in logical reasoning [16]. Given that diagnostic uncertainty questions often arise from incomplete or highly nuanced information that escapes common medical databases, ChatGPT may simply overlook steps in logical reasoning [35]. Claude-3, in contrast, committed fewer logical errors. These findings suggest that the ethical training heuristics embedded in AI systems may influence their performance across different medical domains, especially in scenarios involving diagnostic uncertainty.

In addition to comparing the accuracy of each chatbot, Claude-3 responded to the prompts more slowly than ChatGPT, but its answers were generally longer on average. Longer response times may suggest that the LLMs are engaging in more detailed analysis, which could correlate with higher accuracy in scenarios requiring nuanced decision-making. This is partially supported by our findings where Claude-3, with longer response times, performed slightly better than GPT-4o, although the difference was not statistically significant. However, it is important to recognize that response times are also subject to server latency and other external factors, which could introduce variability unrelated to the LLM’s cognitive processing. Therefore, while response time provides some insight into the LLM’s functioning, its interpretation should be approached with caution.

Our investigation is subject to several limitations. Given that ChatGPT is updated regularly, incorporating user feedback, its responses to identical queries might vary over time. We attempted to control for these variations by having the models respond to all multiple-choice questions on the same day, and we confirmed the consistency of responses across two different web browsers and three trials per question. It is essential to consider that the findings of this study are relevant to the specific period when they were collected, as the capabilities of both GPT-4o and Claude-3 are expected to evolve. Moreover, these models depend on cookies for optimal functionality and their responses can be affected by prior inputs. To counteract this, we regularly cleared conversation histories and memory before entering new prompts. Another consideration is that our questions were multiple-choice; the models’ performance might differ with open-ended questions or tasks requiring prioritization.

Conclusions

In conclusion, while AI chatbots like GPT-4o and Claude-3 show promise in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents. The influence of ethical rule sets on AI performance warrants further investigation, as a virtue ethics framework may offer some advantages in managing complex clinical decisions. Future studies should focus on exploring the capabilities of AI in authentic healthcare contexts, particularly in its role as a clinical decision support tool intended to augment, not replace, physician clinical reasoning.

Data availability

The data that support the findings of this study may be requested at ry.huang@mail.utoronto.ca with support from the principal investigator Fok-Han Leung.

Abbreviations

AI:

Artificial Intelligence

PGY-1:

Postgraduate Year 1

PGY-2:

Postgraduate Year 2

DFCM:

Department of Family and Community Medicine

CI:

Confidence Interval

USMLE:

United States Medical Licensing Exam

LLM:

Large Language Model

References

  1. Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J Jun. 2019;6(2):94–8. https://doi.org/10.7861/futurehosp.6-2-94.

    Article  Google Scholar 

  2. Felfeli T, Huang RS, Lee T-SJ, et al. Assessment of predictive value of artificial intelligence for ophthalmic diseases using electronic health records: a systematic review and meta-analysis. JFO Open Ophthalmol. 2024;09. https://doi.org/10.1016/j.jfop.2024.100124. /01/ 2024;7:100124.

  3. Fradgley EA, Paul CL, Bryant J. A systematic review of barriers to optimal outpatient specialist services for individuals with prevalent chronic diseases: what are the unique and common barriers experienced by patients in high income countries? International Journal for Equity in Health. 2015/06/09 2015;14(1):52. https://doi.org/10.1186/s12939-015-0179-6

  4. Hopkins AM, Logan JM, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr. 2023;7(2):pkad010. https://doi.org/10.1093/jncics/pkad010.

    Article  Google Scholar 

  5. Mihalache A, Huang RS, Popovic MM, Muni RH. Artificial intelligence chatbot and Academy Preferred Practice Pattern® Guidelines on cataract and glaucoma. J Cataract Refractive Surg. 2024;50(5).

  6. Patil NS, Huang R, Mihalache A, THE ABILITY OF ARTIFICIAL INTELLIGENCE CHATBOTS ChatGPT AND GOOGLE BARD TO ACCURATELY CONVEY PREOPERATIVE INFORMATION FOR PATIENTS UNDERGOING OPHTHALMIC SURGERIES. Retina. 2024;44(6).

  7. Patil NS, Huang RS, van der Pol CB, Larocque N. Using Artificial Intelligence Chatbots as a radiologic decision-making Tool for Liver Imaging: do ChatGPT and Bard communicate information consistent with the ACR appropriateness Criteria? J Am Coll Radiol. Oct 2023;20(10):1010–3. https://doi.org/10.1016/j.jacr.2023.07.010.

  8. Patil NS, Huang R, Caterine S, Varma V, Mammen T, Stubbs E. Comparison of Artificial Intelligence Chatbots for Musculoskeletal Radiology Procedure Patient Education. J Vasc Interv Radiol Apr. 2024;35(4):625–e62726. https://doi.org/10.1016/j.jvir.2023.12.017.

    Article  Google Scholar 

  9. Mihalache A, Huang RS, Patil NS, et al. Chatbot and Academy Preferred Practice Pattern guidelines on Retinal diseases. Ophthalmol Retina Mar. 2024;17. https://doi.org/10.1016/j.oret.2024.03.013.

  10. Baker A, Perov Y, Middleton K, et al. A comparison of Artificial Intelligence and human doctors for the purpose of triage and diagnosis. Front Artif Intell. 2020;3:543405. https://doi.org/10.3389/frai.2020.543405.

    Article  Google Scholar 

  11. Howard A, Hope W, Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? Lancet Infect Dis Apr. 2023;23(4):405–6. https://doi.org/10.1016/s1473-3099(23)00113-5.

    Article  Google Scholar 

  12. Bhise V, Rajan SS, Sittig DF, Morgan RO, Chaudhary P, Singh H. Defining and measuring diagnostic uncertainty in Medicine: a systematic review. J Gen Intern Med. Jan 2018;33(1):103–15. https://doi.org/10.1007/s11606-017-4164-1.

  13. Huang RS, Mihalache A, Popovic MM, Kertes PJ, Wong DT, Muni RH. Ocular comorbidities contributing to death in the US. JAMA Netw Open. 2023;6(8):e2331018–2331018. https://doi.org/10.1001/jamanetworkopen.2023.31018.

    Article  Google Scholar 

  14. Alam R, Cheraghi-Sohi S, Panagioti M, Esmail A, Campbell S, Panagopoulou E. Managing diagnostic uncertainty in primary care: a systematic critical review. BMC Fam Pract Aug. 2017;7(1):79. https://doi.org/10.1186/s12875-017-0650-0.

    Article  Google Scholar 

  15. Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach Mar. 2024;46(3):366–72. https://doi.org/10.1080/0142159x.2023.2249588.

    Article  Google Scholar 

  16. Huang RS, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung FH. Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: comparative study. JMIR Med Educ Sep. 2023;19:9:e50514. https://doi.org/10.2196/50514.

    Article  Google Scholar 

  17. Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative performance of ChatGPT and Bard in a text-based Radiology Knowledge Assessment. Can Assoc Radiol J May. 2024;75(2):344–50. https://doi.org/10.1177/08465371231193716.

    Article  Google Scholar 

  18. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ Feb. 2023;8:9:e45312. https://doi.org/10.2196/45312.

    Article  Google Scholar 

  19. Mihalache A, Grad J, Patil NS, et al. Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Eye. 2024. https://doi.org/10.1038/s41433-024-03067-4. /04/13 2024;.

    Article  Google Scholar 

  20. Anthropic A. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card. 2024.

  21. Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems. 2023/01/01/ 2023;3:121–154. https://doi.org/10.1016/j.iotcps.2023.04.003.

  22. Patil NS, Huang RS, van der Pol CB, Larocque N. Reply to: Can ChatGPT Truly Overcome Other LLMs? Canadian Association of Radiologists Journal. 2024/05/01 2023;75(2):430–430. https://doi.org/10.1177/08465371231201379

  23. Crichton TSK, Lawrence K, Donoff M, Laughlin T, Brailovsky C, Bethune C, van der Goes T, Dhillon K, Pélissier-Simard L, Ross S, Hawrylyshyn S, Potter M. Assessment Objectivesfor certification in family medicine. Coll Family Physicians Can. 2020;2.

  24. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. Bmj Jan. 1995;21(6973):170. https://doi.org/10.1136/bmj.310.6973.170.

    Article  Google Scholar 

  25. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style examination: insights into current strengths and limitations. Radiol Jun. 2023;307(5):e230582. https://doi.org/10.1148/radiol.230582.

    Article  Google Scholar 

  26. Ullah E, Parwani A, Baig MM, Singh R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review. Diagn Pathol. 2024/02/27 2024;19(1):43. https://doi.org/10.1186/s13000-024-01464-7

  27. Pattathil N, Lee TJ, Huang RS, Lena ER, Felfeli T. Adherence of studies involving artificial intelligence in the analysis of ophthalmology electronic medical records to AI-specific items from the CONSORT-AI guideline: a systematic review. Graefes Arch Clin Exp Ophthalmol Jul. 2024;2. https://doi.org/10.1007/s00417-024-06553-3.

  28. Ma Y. The potential application of ChatGPT in gastrointestinal pathology. Gastroenterology & Endoscopy. 2023/07/01/ 2023;1(3):130–131. https://doi.org/10.1016/j.gande.2023.05.002

  29. Koga S, Martin NB, Dickson DW. May. Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders. Brain Pathol. 2024;34(3):e13207. https://doi.org/10.1111/bpa.13207

  30. Huang RS, Mihalache A, Popovic MM, et al. Artificial intelligence-based extraction of quantitative ultra-widefield fluorescein angiography parameters in retinal vein occlusion. Can J Ophthalmol. 2024. https://doi.org/10.1016/j.jcjo.2024.08.002. /08/31/ 2024;

    Article  Google Scholar 

  31. Huang RS, Mihalache A, Popovic MM et al. ARTIFICIAL INTELLIGENCE-ENHANCED ANALYSIS OF RETINAL VASCULATURE IN AGE-RELATED MACULAR DEGENERATION. Retina. 2024;44(9).

  32. Huang RS, Kam A. Humanism in Canadian medicine: from the Rockies to the Atlantic. Can Med Educ J. 2024;15(2):97–98. https://doi.org/10.36834/cmej.78391

  33. Tsai C-Y, Hsieh S-J, Huang H-H, Deng J-H, Huang Y-Y, Cheng P-Y. Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings. World J Urol. 2024/04/23 2024;42(1):250. https://doi.org/10.1007/s00345-024-04957-8

  34. Salihu A, Meier D, Noirclerc N et al. A study of ChatGPT in facilitating Heart Team decisions on severe aortic stenosis. EuroIntervention. Apr 15. 2024;20(8):e496-e503. https://doi.org/10.4244/eij-d-23-00643

  35. Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The comparative diagnostic capability of large Language models in Otolaryngology. Laryngoscope. 2024/04/02 2024;n/a(n/a). https://doi.org/10.1002/lary.31434.

    Article  Google Scholar 

Download references

Acknowledgements

None.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Roles/Writing - original draft; and Writing - review & editing.

Corresponding author

Correspondence to Ryan S. Huang.

Ethics declarations

Ethics approval and consent to participate

Ethical approval was obtained from the University of Toronto Research Ethics Board (#00044429). Informed consent was acquired from all participants.

Consent for publication

Not Applicable.

Clinical trial number

N/A. This study is not a clinical trial.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, R.S., Benour, A., Kemppainen, J. et al. The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty. BMC Med Educ 24, 1133 (2024). https://doi.org/10.1186/s12909-024-06115-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12909-024-06115-5

Keywords