Some medical doctors see LLMs as a boon for medical literacy. The typical affected person may battle to navigate the huge panorama of on-line medical data—and, particularly, to tell apart high-quality sources from polished however factually doubtful web sites—however LLMs can try this job for them, at the least in concept. Treating sufferers who had searched for his or her signs on Google required “a variety of attacking affected person anxiousness [and] decreasing misinformation,” says Marc Succi, an affiliate professor at Harvard Medical Faculty and a practising radiologist. However now, he says, “you see sufferers with a university schooling, a highschool schooling, asking questions on the degree of one thing an early med scholar may ask.”
The discharge of ChatGPT Well being, and Anthropic’s subsequent announcement of latest well being integrations for Claude, point out that the AI giants are more and more keen to acknowledge and encourage health-related makes use of of their fashions. Such makes use of definitely include dangers, given LLMs’ well-documented tendencies to agree with customers and make up data moderately than admit ignorance.
However these dangers additionally need to be weighed towards potential advantages. There’s an analogy right here to autonomous automobiles: When policymakers take into account whether or not to permit Waymo of their metropolis, the important thing metric is just not whether or not its vehicles are ever concerned in accidents however whether or not they trigger much less hurt than the established order of counting on human drivers. If Dr. ChatGPT is an enchancment over Dr. Google—and early proof suggests it could be—it might doubtlessly reduce the large burden of medical misinformation and pointless well being anxiousness that the web has created.
Pinning down the effectiveness of a chatbot corresponding to ChatGPT or Claude for client well being, nonetheless, is difficult. “It’s exceedingly troublesome to judge an open-ended chatbot,” says Danielle Bitterman, the medical lead for information science and AI on the Mass Normal Brigham health-care system. Massive language fashions rating effectively on medical licensing examinations, however these exams use multiple-choice questions that don’t mirror how folks use chatbots to search for medical data.
Sirisha Rambhatla, an assistant professor of administration science and engineering on the College of Waterloo, tried to shut that hole by evaluating how GPT-4o responded to licensing examination questions when it didn’t have entry to an inventory of attainable solutions. Medical consultants who evaluated the responses scored solely about half of them as solely right. However multiple-choice examination questions are designed to be difficult sufficient that the reply choices don’t give them solely away, and so they’re nonetheless a fairly distant approximation for the type of factor {that a} consumer would sort into ChatGPT.
A completely different research, which examined GPT-4o on extra life like prompts submitted by human volunteers, discovered that it answered medical questions accurately about 85% of the time. After I spoke with Amulya Yadav, an affiliate professor at Pennsylvania State College who runs the Accountable AI for Social Emancipation Lab and led the research, he made it clear that he wasn’t personally a fan of patient-facing medical LLMs. However he freely admits that, technically talking, they appear as much as the duty—in spite of everything, he says, human medical doctors misdiagnose sufferers 10% to fifteen% of the time. “If I have a look at it dispassionately, plainly the world is gonna change, whether or not I prefer it or not,” he says.
For folks in search of medical data on-line, Yadav says, LLMs do appear to be a more sensible choice than Google. Succi, the radiologist, additionally concluded that LLMs generally is a higher various to internet search when he in contrast GPT-4’s responses to questions on frequent continual medical situations with the data introduced in Google’s information panel, the data field that generally seems on the best facet of the search outcomes.
Since Yadav’s and Succi’s research appeared on-line, within the first half of 2025, OpenAI has launched a number of new variations of GPT, and it’s affordable to anticipate that GPT-5.2 would carry out even higher than its predecessors. However the research do have necessary limitations: They concentrate on easy, factual questions, and so they look at solely temporary interactions between customers and chatbots or internet search instruments. A few of the weaknesses of LLMs—most notably their sycophancy and tendency to hallucinate—may be extra more likely to rear their heads in additional in depth conversations and with people who find themselves coping with extra complicated issues. Reeva Lederman, a professor on the College of Melbourne who research know-how and well being, notes that sufferers who don’t just like the prognosis or therapy suggestions that they obtain from a physician may hunt down one other opinion from an LLM—and the LLM, if it’s sycophantic, may encourage them to reject their physician’s recommendation.
Some research have discovered that LLMs will hallucinate and exhibit sycophancy in response to health-related prompts. For instance, one research confirmed that GPT-4 and GPT-4o will fortunately settle for and run with incorrect drug data included in a consumer’s query. In one other, GPT-4o regularly concocted definitions for pretend syndromes and lab exams talked about within the consumer’s immediate. Given the abundance of medically doubtful diagnoses and coverings floating across the web, these patterns of LLM conduct might contribute to the unfold of medical misinformation, significantly if folks see LLMs as reliable.
OpenAI has reported that the GPT-5 sequence of fashions is markedly much less sycophantic and vulnerable to hallucination than their predecessors, so the outcomes of those research may not apply to ChatGPT Well being. The corporate additionally evaluated the mannequin that powers ChatGPT Well being on its responses to health-specific questions, utilizing their publicly obtainable HeathBench benchmark. HealthBench rewards fashions that specific uncertainty when acceptable, suggest that customers search medical consideration when needed, and chorus from inflicting customers pointless stress by telling them their situation is extra critical that it actually is. It’s affordable to imagine that the mannequin underlying ChatGPT Well being exhibited these behaviors in testing, although Bitterman notes that a number of the prompts in HealthBench had been generated by LLMs, not customers, which might restrict how effectively the benchmark interprets into the true world.
