A new study, published in Nature Medicine, from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, carried out in partnership with MLCommons, researchers from the Bangor University's North Wales Medical School and other institutions, reveals a major gap between the promise of large language models (LLMs) and their usefulness for people seeking medical advice. While these models now excel at standardised tests of medical knowledge, they pose risks to real users seeking help with their own medical symptoms.
Key findings:
- No better than traditional methods
Participants used LLMs to identify investigate health conditions and decide on an appropriate course of action, such as seeing a GP, or going to the hospital, based on information provided in a series of specific medical scenarios developed by doctors. Those using LLMs did not make better decisions than participants who relied on traditional methods like online searches or their own judgment.
- Communication breakdown
The study revealed a two-way communication breakdown. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action.
- Existing tests fall short
Current evaluation methods for LLMs do not reflect the complexity of interacting with human users. Like clinical trials for new medications, LLM systems should be tested in the real world before being deployed.
“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” said Dr Rebecca Payne, GP, lead medical practitioner on the study, Clinical Senior Lecturer, Bangor University and Clarendon-Reuben Scholar, Nuffield Department of Primary Care Health Sciences. “Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.”
Real users, real challenges
In the study, researchers conducted a randomised trial involving nearly 1,300 online participants who were asked to identify potential health conditions and recommended course of action, based on personal medical scenarios. The detailed scenarios, developed by doctors, ranged from a young man developing a severe headache after a night out with friends for example, to a new mother feeling constantly out of breath and exhausted.
One group used an LLM to assist their decision-making, while a control group used other traditional sources of information. The researchers then evaluated how accurately participants identified the likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. They also compared these outcomes to the results of standard LLM testing strategies, which do not involve real human users. The contrast was striking; models that performed well on benchmark tests faltered when interacting with people.
They found evidence of three types of challenge:
- Users often didn’t know what information they should provide to the LLM
- LLMs provided very different answers based on slight variations in the questions asked
- LLMs often provided a mix of good and bad information which users struggled to distinguish.
“Designing robust testing for large language models is key to understanding how we can make use of this new technology,” said lead author Andrew Bean, a doctoral researcher at the Oxford Internet Institute.
“In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.”