Clinical knowledge in LLMs does not translate to human interactions

(arxiv.org)

95 points | by insistent 17 hours ago

14 comments

Fripplebubby 16 hours ago
Interesting quote from the venturebeat article linked:
> “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.
In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human, the human might not know what is important or they might be disinclined to share, and physicians can learn to overcome this. However, in this study, this isn't actually what happened - the participants were looking to diagnose a made-up scenario, where the symptoms were clearly presented to them, and they had no incentive to lie or withhold embarrassing symptoms since they weren't actually happening to them, it was all made up - and yet, it still seemed to happen, that the participants did not effectively communicate all the necessary information.
[-]
- littlestymaar 11 hours ago
  > In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human
  That's true for most use-case, especially for coding.
- TZubiri 16 hours ago
  As a patient, I am responsible for sharing information to my doctor. I wouldn't hold it against them if they didn't extract information from me.
  [-]
  - smogcutter 16 hours ago
    Sure, but think of a good help desk tech: if they waited for users to accurately report useful information, nothing would ever get fixed.
    [-]
    - aleph_minus_one 5 hours ago
      > Sure, but think of a good help desk tech: if they waited for users to accurately report useful information, nothing would ever get fixed.
      I sometimes also have to do "help-desk-like" duties on the applications that I am responsible for (think like 3rd level technical support):
      I can tell you that you can train your users to give more helpful useful information (but of course sometimes they don't don't know by themselves what is important and what is not).
  - SecretDreams 16 hours ago
    Sure. But as a patient, you are also not expected to know what is or isn't important. Omitting unimportant information (to you) because your brain does a low pass filter is partially what the doctor is trying to bypass.
    [-]
    - mlinhares 14 hours ago
      Its as if every single person had to be an expert in every field to be able to function, that's really not a thing and we expect the actual experts to know how to extract the needed information.
      That's one of the main differences between mediocre and incredible engineers, being able to figure out what the problem that needs to be solved is and not work on whatever a stakeholder asks them to build.
  - numpad0 14 hours ago
    Okay, so your code has been segfaulting at line 123 in complicated_func.cpp, and you want to know to which version of libc you have to roll back to as well as related packages if any.
    What's the current processor temperature, EPS12V voltage, and ripple peaks if you have a oscilloscope? Could you paste cpuinfo? Have you added or removed RAM or PCIe device recently? Does the chassis smell and look normal, no billowing smoke, screeching noise, fire?
    Good LLMs might start asking these questions soon, but you wouldn't supply these information at the beginning of interaction(and it's always the PSU).
  - mumbisChungo 14 hours ago
    Yeah, there's a lot of agency on both sides of the equation when it comes to any kind of consultant. You're less likely to have bad experiences with doctors if you're self aware and thoughtful about how you interact with them.
  - BriggyDwiggs42 14 hours ago
    You don’t want to make systems that require people to be as diligent as you because those systems will have bad outcomes.
hackitup7 14 hours ago
This is just a random anecdote but ChatGPT (when given many, many details with 100% honesty) has essentially matched exactly what doctors told me in every case where I've tested it. This was across several non-serious situations (what's this rash) and one quite serious situation, although the last is a decently common condition.
The two times that ChatGPT got a situation even somewhat wrong, were:
- My kid had a rash and ChatGPT thought it was one thing. His symptoms changed slightly the next day, I typed in the new symptoms, and it got it immediately. We had to go to urgent care to get confirmation, but in hindsight ChatGPT had already solved it. - In another situation my kid had a rash with somewhat random symptoms and the AI essentially said "I don't know what this is but it's not a big deal as far as the data shows." It disappeared the next day.
It has never gotten anything wrong other than these rashes. Including issues related to ENT, ophthalmology, head trauma, skincare, and more. Afaict it is basically really good at matching symptoms to known conditions and then describing standard of care (and variations).
I now use it as my frontline triage tool for assessing risk. Specifically ChatGPT says "see a doctor soon/ASAP" I do it, if it doesn't say to go see a doctor, I use my own judgment ie I won't skip a doctor trip if I'm nervous just because AI said so. This is all 100% anecdotes and I'm not disagreeing with the study, but I've been incredibly impressed by its ability to rapidly distill medical standard of care.
[-]
- extr 12 hours ago
  I've had an identical experience of ChatGPT misidentifying my kid's rash. In my case I would say it got points for being in the same ballpark - it guessed HFM, the real answer was "an unnamed similar-ish virus to HFM but not HFM proper". The treatment was the same, just let it run it's course and our kid was fine. But I think it also made me realize that our pediatrician is still quite important in the sense that she has local, contextual, geography-based knowledge of what other kids in the area are experiencing too. She recognized it immediately because she had already seen 2 dozen other kids with it in the last month. That's going to be hard for any AI system to replicate until some distant time when all healthcare data is fed into The Matrix.
- brundolf 12 hours ago
  I wonder if the software developer mindset plays into this. We're really good at over-reporting all possibly-relevant information for "debugging" purposes
- forgetfreeman 14 hours ago
  I sincerely hope your credulity doesn't swing around to bite you in the ass with this.
zora_goron 15 hours ago
This difference between medical board examinations and real world practice is something that mirrors my real-world experience too, having finished med school and started residency a year ago.
I’ve heard others say before that real clinical education starts after medical school and once residency starts.
[-]
- keeptrying 14 hours ago
  Could you elaborate on what you mean?
  That 80% of medical issues could be categorized as "standard medicine" with some personalization to the person?
  residency you obviously see a lot of real life complicated cases but aren't the majority of the cases something a non resident could guide if not diagnose ?
bryant 17 hours ago
For anyone keen on dissecting this further, they uploaded enough to github for people to dive into their approach in depth.
https://github.com/am-bean/HELPMed (also linked in the paper)
dosinga 17 hours ago
Really what it seems to say is that LLMs are pretty good at identifying underlying causes and recommending medical actions but if you let humans use LLMs to self diagnose the whole thing falls apart, if I read this correctly
[-]
- majormajor 17 hours ago
  Yeah it sounds like "LLMs are bad at interacting with lay humans compared to being prompted by experts or being given well-formed questions like from licensing exams."
  Feels to me like how two years ago "prompt engineering" got a bunch of hype in tech companies, and now is nonexistent because the models began being trained and prompted specifically to mimic "reasoning" for the sorts of questions tech company users had. Seems like that has not translated to reasoning their way through the sort of health conversations a non-medical-professional would initiate.
  [-]
- wongarsu 16 hours ago
  And there seem to be concrete results that would allow you to improve the LLM prompt to make these interactions more successful. Apparently giving the human 2-3 possible options and letting the human have the final choice was a big contributor to the bad results. Their recommendations go the route of "the model should explain it better" but maybe the best results would be achieved if the model was prompted to narrow it down until there is only one likely diagnosis left. This is more or less how doctors operate after all.
cckolon 2 hours ago
My wife is a doctor. When I showed her this paper she said “that’s true for humans as well”.
twotwotwo 16 hours ago
At work, one of the prompt nudges that didn't work was asking it to ask for clarifications or missing info rather than charge forward with a guess. "Sometimes do X" instructions don't do well generally when the trigger conditions are fuzzy. (Or complex but stated in few words like "ask for missing info.") I can believe part of the miss here would be not asking the right questions--that seems to come up in some of their sample transcripts.
In general at work nudging them towards finding the information they need--first search for the library to be called, etc.--has been spotty. I think tool makers are putting effort into this from their end: newer versions of IDEs seemed to do better than older ones and model makers have added things like mid-reasoning tool use that could help. The raw Internet is not full of folks transparently walking through info-gathering or introspecting about what they know or don't, so it probably falls on post-training to explicitly focus on these kinds of capabilities.
I don't know what you really do. You can lean on instruction-following and give a lot of examples and descriptions of specific times to ask specific kinds of questions. You could use prompt distillation to try to turn that into better model tendencies. You could train on lots of transcripts (these days they'd probably include synthetic). You could do some kind of RL for skill at navigating situations where more info may be needed. You could treat "what info is needed and what behavior gets it?" as a type of problem to train on like math problems.
keeptrying 14 hours ago
I've seen that LLMs hallucinate in very subtle ways when guidng you through a course of treatment.
Once when having to administer eyedrops to a parent, and I saw redness and was being conservative, it told me the wrong drop to stop. The doctor saw my parent the next day so it was all fixed but did lead to me freaking out.
Doctors behave very differently from how we normal humans behave. They go through testing that not many of us would be able to sit through let alone pass. And they are taught a multitude of subjects that are so far away from the subjects everyone else learns that we have no way to truly communicate to them.
And this massive chasm is the problem, not that the LLM is the wrong tool.
Thinking probabilistically (mainly basyesia) and understanding the initial first two years of medschool will help you use an LLM much more effectively for your health.
callc 12 hours ago
My immediate reaction is “absolutely not”. Unless the healthcare provider is willing to accept liability for the output and recommendations of their LLM. Are they willing to put their money to where their mouth is? Or are they just trying to reduce cost, increase profit?
Then I think, if you don’t have access to good healthcare, need to wait weeks or months to get anywhere, or healthcare is extremely expensive, then LLM may be a good option, even with chance for bad (possibly deadly) advice.
If there are any doctors here, would love to hear your opinion.
wongarsu 17 hours ago
That's an interesting result. I would love to see a follow-up with two control groups: humans with assistance from an LLM, humans with assistance from a doctor and humans with no assistance.
This study tells us that LLM assistance is as good as no assistance, but any investigation of the cause feels tainted by the fact that we don't know how much a human would have helped.
If we believe the assertion that LLMs are on a similar level as doctors on finding the conditions on their own, does the issue appear in the description the humans give the LLM, the way the LLM talks to the human, or the way the human receives the LLM suggestions? When looking at chat transcripts they seem to identify issues with all three, but there isn't really a baseline on what we would consider "good" performance
pyman 13 hours ago
Interesting paper. LLMs have the knowledge but lack social skills. they fail when interacting with real patients. So, maybe, the real bottleneck isn't knowledge after all?
[-]
- im3w1l 6 hours ago
  Not sure if bottleneck is the right word. Like it seems more like something that was forgotten about but turned out to be important, and that we know it matters, might not be too hard to fix.
dhash 16 hours ago
I love this kind of research since it correctly identifies some issues with the way the public interacts with LLM’s. Thank you for the evening reading!
I’d love to see future work investigating - how does this compare to expert users (doctors/llm magicians using LLM’s to self diagnose)
- LLM’s often provide answers faster than doctors, and often with less hassle (what’s your insurance?), to what extent does latency impact healthcare outcomes
- do study participants exhibit similar follow on behavior (upcoding, seeking a second opinion, doctors) to others in the same professional discipline
[-]
- dgfitz 16 hours ago
  > how does this compare to expert users
  You’re conflating a person trained in a craft (medicine) with a person good at asking a next-token-generator (anybody) and sussing it off as if it is a given. Its not.
ekianjo 17 hours ago
> perform no better than the control group
This is still impressive. Does it mean it can replace humans in the loop with no loss?
[-]
- jdiff 17 hours ago
  No, the control group was instructed to "use any methods they would typically employ at home." So ChatGPT is no better than WebMD.
  [-]
  - ekianjo 17 hours ago
    It's better as in, it's faster to give you an answer versus reading pages of WebMD
    [-]
    - majormajor 17 hours ago
      Where are you getting that from? (And again, no more "human in the loop" in "reading WebMd" than "talk to chatbot.")
      > Participants using an LLM identified relevant conditions less consis- tently than those in the control group, identifying at least one relevant condition in at most 34.5% of cases compared to 47.0% for the control.
      So good old "do your own research" (hardly a gold standard, still, too, at 47%) is doing like 35% better for people than "talk to the chatbot."
      The more interesting part is:
      > We found that the LLMs suggested at least one relevant condition in at least 65.7% of conversations with participants [...] with observed cases of participants providing incomplete information and LLMs misinterpreting prompts
      since this is nearly double the rate at which participants actually came away with a relevant condition identification, suggesting that the bots are way worse at the interactions than they are at the information. That's presumably trainable, but it also requires a certain patience and willingness on the part of the human, which seems like a bit of a black art for a machine to be able to learn how to coax out of everyone all the time.
      But it's not just a failure to convince, it's also a failure to elicit the right information and/or understand it - the LLM being prompted in a controlled fashion, vs having to have a conversation with the participant, found at least one relevant condition even more often still!
    - brianpan 17 hours ago
      You're wrong most of the time, but at least you get there quickly.
- majormajor 17 hours ago
  What human? The control group was "instructed to instead use any methods they would typically employ at home." Most people don't have human-doctors-in-the-loop at home.