I see more and more research papers coming out these days about different uses of large language models (LLMs, a type of AI) in healthcare. There are papers evaluating it for supporting clinicians in decision-making, aiding in note-taking and improving clinical documentation, and enhancing patient education. But I see a wide-sweeping trend in the titles and conclusions of these papers, exacerbated by media headlines, making sweeping claims about the performance of one model versus another. I challenge everyone to pause and consider a critical fact that is less obvious: the prompt matters just as much as the model.
As an example of this, I will link to a recent pre-print of a research article I worked on with Liz Salmi (pre-print here).
Liz nerd-sniped me about an idea of a study to have a patient and a neuro-oncologist evaluate LLM responses related to patient-generated queries about a chart note (or visit note or open note or clinical note, however you want to call it). I say nerd-sniped because I got very interested in designing the methods of the study, including making sure we used the APIs to model these ‘chat’ sessions so that the prompts were not influenced by custom instructions, ‘memory’ features within the account or chat sessions, etc. I also wanted to test something I’ve observed anecdotally from personal LLM use across other topics, which is that with 2024-era models the prompt matters a lot for what type of output you get. So that’s the study we designed, and wrote with Jennifer Clarke, Zhiyong Dong, Rudy Fischmann, Emily McIntosh, Chethan Sarabu, and Catherine (Cait) DesRoches, and I encourage you to check out the pre-print and enjoy the methods section, which is critical for understanding the point I’m trying to make here.
In this study, the data showed that when LLM outputs were evaluated for a healthcare task, the results varied significantly depending not just on the model but also on how the task was presented (the prompt). Specifically, persona-based prompts—designed to reflect the perspectives of different end users like clinicians and patients—yielded better results, as independently graded by both an oncologist and a patient.
The Myth of the “Best Model for the Job”
Many research papers conclude with simplified takeaways: Model A is better than Model B for healthcare tasks. While performance benchmarking is important, this approach often oversimplifies reality. Healthcare tasks are rarely monolithic. There’s a difference between summarizing patient education materials, drafting clinical notes, or assisting with complex differential diagnosis tasks.
But even within a single task, the way you frame the prompt makes a profound difference.
Consider these three prompts for the same task:
- “Explain the treatment options for early-stage breast cancer.”
- “You’re an oncologist. Explain the treatment options for early-stage breast cancer.”
- “You’re an oncologist. Explain the treatment options for early-stage breast cancer as you would to a newly diagnosed patient with no medical background.”
The second and third prompt likely result in a more accessible and tailored response. If a study only tests general prompts (e.g. prompt one), it may fail to capture how much more effective an LLM can be with task-specific guidance.
Why Prompting Matters in Healthcare Tasks
Prompting shapes how the model interprets the task and generates its output. Here’s why it matters:
- Precision and Clarity: A vague prompt may yield vague results. A precise prompt clarifies the goal and the speaker (e.g. in prompt 2), and also often the audience (e.g. in prompt 3).
- Task Alignment: Complex medical topics often require different approaches depending on the user—whether it’s a clinician, a patient, or a researcher.
- Bias and Quality Control: Poorly constructed prompts can inadvertently introduce biases
Selecting a Model for a Task? Test Multiple Prompts
When evaluating LLMs for healthcare tasks—or applying insights from a research paper—consider these principles:
- Prompt Variation Matters: If an LLM fails on a task, it may not be the model’s fault. Try adjusting your prompts before concluding the model is ineffective, and avoid broad sweeping claims about a field or topic that aren’t supported by the test you are running.
- Multiple Dimensions of Performance: Look beyond binary “good” vs. “bad” evaluations. Consider dimensions like readability, clinical accuracy, and alignment with user needs, as an example when thinking about performance in healthcare. In our paper, we saw some cases where a patient and provider overlapped in ratings, and other places where the ratings were different.
- Reproducibility and Transparency: If a study doesn’t disclose how prompts were designed or varied, its conclusions may lack context. Reproducibility in AI studies depends not just on the model, but on the interaction between the task, model, and prompt design. You should be looking for these kinds of details when reading or peer reviewing papers. Take results and conclusions with a grain of salt if these methods are not detailed in the paper.
- Involve Stakeholders in Evaluation: As shown in the preprint mentioned earlier, involving both clinical experts and patients in evaluating LLM outputs adds critical perspectives often missing in standard evaluations, especially as we evolve to focus research on supporting patient needs and not simply focusing on clinician and healthcare system usage of AI.
What This Means for Healthcare Providers, Researchers, and Patients
- For healthcare providers, understand that the way you frame a question can improve the usefulness of AI tools in practice. A carefully constructed prompt, adding a persona or requesting information for a specific audience, can change the output.
- For researchers, especially those developing or evaluating AI models, it’s essential to test prompts across different task types and end-user needs. Transparent reporting on prompt strategies strengthens the reliability of your findings.
- For patients, recognizing that AI-generated health information is shaped by both the model and the prompt. This can support critical thinking when interpreting AI-driven health advice. Remember that LLMs can be biased, but so too can be humans in healthcare. The same approach for assessing bias and evaluating experiences in healthcare should be used for LLM output as well as human output. Everyone (humans) and everything (LLMs) are capable of bias or errors in healthcare.
TLDR: Instead of asking “Which model is best?”, a better question might be:
“How do we design and evaluate prompts that lead to the most reliable, useful results for this specific task and audience?”
I’ve observed, and this study adds evidence, that prompt interaction with the model matters.