ChatGPT-assisted diagnosis: Is the future suddenly here?

I have been framing the future of health care as diagnosis through face-to-face interactions for making a diagnosis (and getting the patient to buy-in to the treatment plan) and ongoing maintenance or surveillance using telemedicine with supporting technologies. This article has forced me to reconsider how we might deploy artificial intelligence to support patients and clinicians during the triage and diagnosis stages of a medical journey.

“Symptom checkers serve two main functions: they facilitate self-diagnosis and assist with self-triage. They typically provide the user with a list of potential diagnoses and a recommendation of how quickly they should seek care, like see a doctor right now vs. you can treat this at home. [..]

Our team once tested the performance of 23 symptom checkers using 45 clinical vignettes across a range of clinical severity. The results raised substantial concerns. On average, symptom checkers listed the correct diagnosis within the top three options just 51% of the time and advised seeking care two-thirds of the time.

When the same vignettes were given to physicians, they — reassuringly — did much better and were much more likely to list the correct diagnosis within the top three options (84%). Though physicians were better than symptom checkers, consistent with prior research, misdiagnosis was still common. [..]

We gave ChatGPT the same 45 vignettes previously tested with symptom checkers and physicians. It listed the correct diagnosis within the top three options in 39 of the 45 vignettes (87%, beating symptom checkers’ 51%) and provided appropriate triage recommendations for 30 vignettes (67%). Its performance in diagnosis already appears to be improving with updates. When we tested the same vignettes with an older version of ChatGPT, its accuracy was 82%. [..]

Some caveats first: We tested a small sample, just 45 cases, and used the kind of clinical vignettes that are used to test medical students and residents, which may not reflect how the average person might describe their symptoms in the real world. So we are cautious about the generalizability of our results. In addition, we have noticed that ChatGPT’s results are sensitive to how information is presented and what questions are being asked. In other words, more rigorous testing is needed.

That said, our results show that ChatGPT’s performance is a substantial step forward from using Google search or online symptom checkers. Indeed, we are seeing a computer come close to the performance of physicians in terms of diagnosis, a critical milestone in the development of AI tools. ChatGPT is only the start. Google has recently announced its own AI chatbot and many other companies are likely to follow suit.

[..] AI tools could become a standard part of clinical care to reduce misdiagnosis, which unfortunately remains much too common in health care: An estimated 10% to 15% of diagnoses are wrong. There are many underlying reasons for misdiagnoses, ranging from physicians anchoring too quickly on a diagnosis to overconfidence. Tools like ChatGPT could be used as an adjunct, just as adjunctive AI tools are being used for other clinical applications. Radiologists who read CT images, for example, now use AI algorithms to flag those showing an intracranial hemorrhage or a blood clot in the lungs.

While this future is exciting, unknowns and pitfalls exist. A key one is how a patient’s history, physical exam findings, and test results would be fed into an algorithm in a clinic’s workflow. Another is that while AI algorithms are prone to errors — as are humans — people sometimes place undue trust in AI output. If a physician disagrees with the AI’s output, how will this affect patient and physician interactions, and will such disagreements need to be adjudicated?

AI models are also prone to bias. No matter what the size of the internet-based source material, it does not ensure that the AI will show diversity in the response it provides. Instead, it runs the risk of amplifying harmful biases and stereotypes that are embedded in the source material. The size of the material used for these algorithms may also make it difficult, or even impossible, to adapt to changing social views and clinical norms.

Despite these unknowns, it appears the future of computer-assisted diagnosis is suddenly here and the health care system will now need to respond and address these challenges.”

Full article, R Hailu, A Beam and A Mehrotra, STAT First Opinion, 2023.2.13