Want signal? We need more noise (more examples to address the quiet bottleneck)

Note: I’m assuming if you are reading this post that you’ve first read this post, summarizing the concept. The below is a companion piece with some expanded details about the concepts and more examples of how I think addressing this bottleneck will help make a difference.

You might hold the perspective that in the growing era of AI, there’s too much noise already. Slopocalypse. (Side note: this entire post, and my other post, was written by me, a human.) That’s not the biggest problem in healthcare, though: in both research and clinical care, so much critical data is simply not collected. We are missing sooooo much important data. Some of this is an artifact of past clinical trial design and how it was hard to collect or analyze or store the data; there were less established norms around data re-use; and some of it is a “collect the bare minimum to study the endpoints because that’s all we are supposed to do” phenomenon.

Nowadays, we should be thinking about how to make data and insights from clinical work and research available to AI, too, because AI will increasingly be used by humans to sort through what is known and where the opportunities are, plus make cross-domain connections that humans have been missing. And we need to be thinking about whether the incentives are set up correctly (spoiler: they’re not) to make sure all available data is able to be collected and shared or at least stored for humans and AI to have access to for future insight. What is studied is also ‘what is able to be funded’, which is disproportionately things that come to the commercial market eventually after regulatory approval.)

“But what’s the point?” you ask. “If no one is looking at the data from this study, why collect it?” Because the way we design studies – to answer a very limited-scope question, i.e. safety and efficacy for labeling claims and regulatory approval – is very different than the studies we need around optimizing and personalizing treatments. If you look at really large populations of people accessing a treatment –  think GLP-1 RA injectables, for example – you eventually start to see follow-on studies around optimization and different titration recommendations. But most diseases and most treatments aren’t for a population even 1/10th the size of people accessing those meds. So those studies don’t get done. Doctors don’t necessarily pay attention to this; patients may or may not report relevant data about this back to the doctor (and again even if they do: doctors don’t mentally or physically store this data, often); and the signal is missed because we don’t capture this ‘noise’.

This means people with rare diseases, undiagnosed diseases, atypical or seronegative diseases, unusual responses to treatments, multiple conditions (comorbidities), or any other scenario that results in them being part of a small population and where customizing and individualizing their care is VERY important…they have no evidence-based guidance. Clinicians do the best they can to interpret in the absence of evidence or guidelines, but no wonder patients often turn to LLMs (AI) for additional input and contextualization and discussion of the tradeoffs and pros and cons of different approaches.

We have to: the best available human (our clinicians) may not have the relevant expertise; or they may not be available at all; or they may be biased (even subconsciously) or forgetful of critical details or not up to date on the latest evidence; or they may be faced with a truly novel situation and they don’t have the skills to address it because they’re used to cookie-cutter standardized cases. This is not a dig at clinicians, but I recognize that we now have tools that can address some of the existing flaws in our human-based healthcare system. I keep talking about how we need to recognize that evaluations of AI in healthcare shouldn’t treat the status quo as the baseline to defend, because the status quo itself has problems, as I just described.

And AI has some of the same problems. (Except for the ‘not available’ part, unless you consider the lower-utility free access models to mean that the more advanced, thinking-based models are ‘not available’ because of cost access barriers). An AI may not have any training data on a rare disease… because nothing exists. It may drop information out of the context window, and we may not realize that this has happened (i.e., it ‘forgets’ something). Usually these are critiques of AI, juxtaposed against the implication that humans are better. But notice that these critiques are the same of humans, too! This happens all the time with human clinicians in healthcare. A human can’t make decisions on data that doesn’t exist in the world, either!

So…how do we “fix” AI? Or, how do we fix human healthcare? We should be asking BOTH questions. Maybe the answer is the same: increase the noise so we increase the signal. Argue about the ratio later (of signal:noise), but increase the amount of everything first. That is most important.

How might we do this? I have been thinking about this a lot, and Astera recently posted an essay contest asking for ideas that don’t fit the current infrastructure. (That’s why I’m finally writing this up, not because I think I’ll win the essay contest, but mostly because i’s an opportunity for people to consider whether/not this type of solution is a good idea as a separate analysis from “well, who is going to fund THAT?”. Let’s discuss and evaluate the idea, or riff on it, without being bogged down by the ‘how exactly it gets funded and managed over time’.)

I think we should create some kind of digitally-managed platform/ecosystem to do the following:

  1. Incentivize and facilitate written (AI-assisted allowed) output of everything. Case reports and scenarios of people and what they’re facing. All the background information that might have contributed. All the data we have access to that is related to the case, plus other passive, easy-to-collect data (that may already be collected, e.g., wearables and phone accelerometer data) even if it does not appear to be related. Patient-perspective narratives, clinical interpretations, clinical data, wearable data – all of this.

    A.) And, be willing to take variable types of data even from the same study population. For example, as a person with type 1 diabetes, I have 10+ years of CGM data. For a future study on exocrine pancreatic insufficiency, for example, not everyone will have CGM data, but because CGM data is increasingly common in people with T2D (a much larger population) and in the general population, a fraction of people will have CGM data and a fraction of people are willing to share it. We should enable this, even if it’s not for a primary endpoint analysis on the EPI study and even if only a fraction of people choose to share it – it might still be useful for subgroup analysis, determining power for a future study where it is part of the protocol, and identifying new research directions!

  2. Host storage somewhere. There can be some automated checking against identifiable information and sanity checking what’s submitted into the repository, plus consent (either self-consent or signed consent forms for this purpose.)
  3. Provide ‘rewards’ as incentives for inputting cases and data.

    A) Rewards might differ for patients self-submitting data and researchers and clinicians inputting data and cases. Maybe it’s one type of reward for patients and a batch reward of a free consult or service of some kind for researchers/clinicians (e.g. every 3 case reports submitted = 1 credit, credits can be used toward a new AI tool or token budget for LLMs or human expertise consult around study design, or all kinds of things.)

    B) Host data challenges where different types of funders incentivize different disease groups or families of conditions to be added, to round out what data is available in different areas.

  4. Enable AI and human access (including to citizen scientists/independent researchers who don’t have institutions) to these datasets, after self-credentialing and providing a documented use case for why the data is being accessed.

    A) Require anyone accessing the dataset to make their analysis code open source, or otherwise openly available for others to use, and to make the results available as well.

    B) Tag anytime an individual dataset is used by a project, that way individuals with self-donated data (or clinicians submitting a case) can revisit and see if there are any research insights or data analysis that applies to their case.

    C) Provide starter projects to show how the data can be used for novel insight generation. For example, develop sandboxes for different types of datasets with existing ‘lab notebooks’, so to speak, to onboard people to different datasets/groups of data, different types of analyses they might do, and to cut down on the environment setup barriers to getting started to analyzing this data.

    D) Facilitate outreach to institutions, disease-area patient nonprofits and advocacy groups both to solicit data inputs AND to use the data for generating outputs.

Why should we do this?

  • Clinical trials do not gather all of the possible useful data, because it’s not for their primary/secondary endpoints. Plus, they would have to clean it upon collection. Plus storage cost. Plus management decisions. Etc. There’s a lot of real barriers there, but the result is that clinical trials do not capture enough or the most possible relevant data. So we need a different path.
  • Clinicians/clinical pathways don’t have ways to review data, so they avoid doing it. There’s no path to “submit this data to my record but I don’t expect you to look at it” in medical charts (but there should be). So we need a different path.
  • Disease groups/organizations sometimes host and capture datasets, but like clinical trials, this is limited data; it often comes through clinical pathways (further limiting participation and diversity of the data); it doesn’t include all the additional data that might be useful. So we need a different path.
  • Some patients do publish or co-author in traditional journals, but it’s a limited, self-selected group. Then there are the usual journal hurdles that filter this down further. Not everyone knows about pre-prints. And, not everyone knows how useful their data can be – either as an n=1 alone or as an n=1*many for it to be worth sharing. A lot of people’s data may be in video/social media formats inaccessible to LLMs (right now) or locked in private Facebook groups or other proprietary platforms. So we need a different path.

Thus, the answer to ‘why we should do this’ is recognizing that: AI does not create magic out of nowhere. It relies on training data and extrapolating from that, plus web search. If it’s not searchable or findable, and it’s not in the training data, it’s not there. Yes, even the newer models and prompting it to extrapolate doesn’t solve all of these problems. What AI can do is shaped disproportionately by formal literature, institutional documents, and whatever scraps of publicly available content remain. If we want to shape what’s possible in the future, we need to start now by shaping and collecting the inputs to make it happen. We can’t fix the past trial designs but we can start to fill in the gaps in the data!

Here is an example of how this might work and why it matters to build and incentivize this type of data sharing.

  1. If we build and incentivize data sharing, the following individuals might self-donate their data and write-ups, or a clinician might submit them. Later, someone might analyze the data and put the pieces together.
  • Someone with type 1 diabetes (an autoimmune condition) and this new-onset muscle-related issue. Glucose levels are well-managed, and they don’t have neuropathy/sensory issues (which are common complications of decades of living with type 1 diabetes), and muscle damage and inflammation markers are normal. MRIs are normal. The person submits their data which includes their lab tests, clinical chart notes (scrubbed for anonymity), a patient write-up of what their symptoms are like, and exported data from their phone with years of motion/activity data.
  • A different person with Sjogren’s disease (also an autoimmune condition) and a similar new-onset muscle-related issue. There are known neurological manifestations or associations with Sjogren’s, but more typically those are small-fiber neuropathy or similar. The symptoms here are not sensory or neuropathy. MRIs don’t show inflammation or muscle atrophy. Their clinician is stumped, but knows they don’t see a lot of patients with Sjogren’s and wonders if there is a cohort of people with Sjogren’s facing this. The clinician asks the patient and gains consent; scrubs the chart note of identifying details, and submits the chart notes, labs, and a clinician summary describing the situation.
  • Later, a researcher (either a traditional institutional-affiliated researcher OR a citizen scientist, such as a third person with a new-onset muscle-related issue) decides to investigate a new-onset muscle-related issue. They register their hypothesis: that there may be a novel autoimmune condition that results in this unique muscle-related issue as a neuromuscular disease (that’s not myasthenia gravis or another common NMJ disease) that shows up in people with polyautoimmunity (multiple autoimmune conditions), but we don’t know which antibodies are likely correlated with it. The research question is to find people in the platform with similar muscle-related conditions and explore the available lab data to help find what might be the connecting situation and classify this disease or better understand the mechanism.
  • They are granted access to the platform, start analyzing, and come across an interesting correlation with antibody X, which is considered a standard autoimmune marker but one that doesn’t differentiate by disease, and which does highly correlate with this possibly novel muscle condition when it exists and is elevated in people with at least one existing autoimmune condition and these muscle-symptoms that are seronegative for every other autoimmune and neurologic condition. Further, by reading the clinical summary related to patient B and the self-written narrative from patient A, it becomes clear that this is likely neuromuscular – stemming from transmission failure along the nerves – and that the muscles are the ‘symptom’ but not the root cause of disease. This provides an avenue for a future research protocol to 1) allow these types of patients to be characterized into a cohort so this can be determined whether it is a novel disease or a subgroup of an adjacent condition (e.g., a seronegative subgroup);  2) track whether the antibody levels are treatment-sensitive or not or stay elevated always; and 3) cohorts of treatments that can be trialed off-label because they work in similar NMJ diseases even though the mechanism isn’t identical.
  • The mechanism for this novel disease isn’t proven, yet, but in the face of all the previous negative lab data and neurological testing patient A and patient B have experienced, it narrows it down from “muscle or neuromuscular” which is a significant improvement from their previous situations. Plus, this provides pathways for additional characterization; research; and eventual treatment options to explore versus the current dead ends both patients (and their clinical teams) are stuck in. And because there are no good clinical pathways for these types of undiagnosed cases, this type of insight development across multiple cases would not have occurred at all without this database of existing data.

2. The above is a small-n example, but consider a large dataset where there are hundreds or thousands of people with CGM data submitted. Plus meal tracking data, because people can export and provide that from whatever meal-logging apps some of them happen to use.

  • By analyzing this big dataset, an individual researcher could hypothesize that they could build a predictor to identity the onset of exocrine pancreatic insufficiency, which can occur in up to 10% of the general population (more frequently in older adults, people with any type of diabetes, as well as other pancreas-related conditions like pancreatitis, different types of cancer, etc), by comparing increases in glucose variability that correlate with a change in dietary consumption patterns, notably around decreasing meal size and eventually lowering the quantity of fat/protein consumed. (These are natural shifts people make when they notice they don’t feel good because their body is not effectively digesting what they are eating). They analyze and exclude the effect of GLP1-RA’s and other medications in this class: the effect size persists outside of medication usage patterns. This can later be validated and tested in a prospective clinical trial, but this dataset can be used to identify what level of correlation between meal consumption change and glucose variability change happen over what period of time in order to power a high-quality subsequent clinical trial. This may lead to an eventual non-invasive method to diagnose exocrine pancreatic insufficiency through wearable and meal-tracking data. (None exists today: only a messy stool test that no one wants to do and is hampered by other issues for accuracy.)

These are two examples and show how even small-n data or a large dataset where there are additional subgroups with additional datasets can be useful. In the past, it’s been a “we need X, Y, Z data from everyone. A, B, C would be nice to have, but it’s hard and not everyone will share it (or is willing to collect it due to burden), so we won’t enable it to be shared for the 30% of people who are willing”. Thus, we lose the gift and contributions from the people who are able and willing to share that data. Sometimes that 30% is a small n but that small n is >0 and may be the ONLY data that will eventually answer an important future research question.

signal-noise-more-examples-DanaMLewisWe are missing so much, because we don’t collect it. So, we should collect it. We need a platform to do this outside of any single disease group patient registry; we need to support clinician and patient entries into this platform; we need to support intake of a variety of types of data; and we need to have low (but sensible) barriers to access so individuals (citizen scientists, patients themselves) can leverage this data alongside traditional researchers. We need all hands on deck, and we need more data collected.

Want signal? We need more noise (looking at the quiet bottleneck)

We need more signal, which means we want more noise. A lot of current scientific infrastructure is designed to minimize messiness: define a narrow question, collect the minimum data required to answer it, standardize the dataset, exclude complicating variables, finish the analysis, publish the result. That approach is understandable. It is also one reason we repeatedly fail the patients who most need evidence: people with unusual responses, multiple conditions, atypical phenotypes, rare diseases, or combinations of features that do not fit cleanly into existing bins. Our systems today are structurally designed to discard or never capture the kinds of heterogeneous, partial, contextual, or longitudinal data that could eventually make critical insights available to us.

My hypothesis is that one important scientific bottleneck is not a lack of intelligence, or even a lack of data in the abstract, but a lack of infrastructure for accepting, preserving, and reusing the kinds of data that fall outside formal trial endpoints and standard clinical workflows. I am proposing a solution: a shared platform that accepts optional, heterogeneous, participant- and clinician-contributed data, which will generate clinically useful subgroup hypotheses that standard trial and registry structures fail to generate.

Here are the quiet bottlenecks in the current system that I run into and that this platform would address:

  1. Atypical, comorbid, and small-population patients are systematically left without usable evidence because current research systems are optimized to suppress heterogeneous data rather than preserve it. (Clinical trials are usually designed around narrow endpoint collection, fixed inclusion and exclusion criteria, and standardized datasets that are tractable for a specific immediate question. Patients are often told there is “no evidence,” when what is really missing is a system that could have captured and preserved evidence that can help with future translation for atypical or edge-case presentations.)
  2. Useful data is routinely discarded because no existing structure is responsible for capturing it. (Clinical trials generally do not want to collect data outside primary and secondary endpoints because of cleaning, storage, analysis, and governance costs. Clinical care systems do not want to ingest large volumes of patient-generated data that clinicians are not expected to review. Disease registries are usually narrow and disease-specific. Journals are poor infrastructure for surfacing up data. There is no home for this data.)
  3. AI and humans are constrained by the same missingness problem: the knowledge base is biased and uneven because we do not capture enough of the right kinds of real-world data. (LLMs and other AI systems are often criticized for bias, holes, and nonrepresentative training data. But this overlooks that clinicians and researchers are also reasoning from the same incomplete literature, selective trial populations, under-collected real-world evidence, and publication-filtered case reports. The bottleneck is upstream of both.)AI now makes it possible to overcome these constraints, but only if the data is collected and made available in the first place.
  1. Current systems privilege uniform completeness over partial but valuable contribution, which causes preventable signal loss. (If not everyone has or is willing to provide the data, it’s not collected. Subsets of participants who are willing and able to contribute additional data, such as wearables, CGM, meal logs, phone sensor streams, or symptom data, are prevented from doing so.)

Why current structures produce these bottlenecks

No single existing institution is responsible for optional data intake, long-term stewardship, and broad downstream access. And different structures prioritize different data capture, with no harmonization between these two settings. Clinical trials are funded, regulated, and staffed to answer bounded questions. Clinical trial incentives reward endpoint completion, analyzable datasets, and publication-ready results. Anything beyond the core clinical trial protocol creates additional costs: more data cleaning, more storage, more governance, and more analytic work that may not directly serve the trial’s primary purpose. Under those conditions, narrowness is rational but it is also systematically lossy. Clinical care systems have a different but related design problem. Electronic health records are built for documentation, care coordination, and billing, not for capturing or allowing participant-generated, future-use data that a clinician does not need for real-time decision making. This illustrates the infrastructure gap: patients may have useful longitudinal data, but there is often no legitimate pathway to store it in a way that supports future analysis. Disease registries and advocacy-group datasets help in some cases, but they are typically narrow in scope, tied to a disease silo, and constrained by the same pressures toward standardization and limited datastreams. Publications are also poorly matched to this problem, because they are optimized for polished outputs, not for preserving heterogeneous data. Much of the infrastructure was shaped from prior underlying constraints (when data capture, storage, and analysis was more expensive).

Testable hypotheses that would address these challenges:

  1. Allowing partial, nonuniform data donation from participants will outperform “uniform minimum dataset only” models for early discovery in rare, atypical, and comorbid populations.
  2. Structured case narratives combined with quantitative data will enable more useful discovery than quantitative data alone for poorly characterized conditions, because the narrative context helps identify mechanistic or phenotypic patterns that are otherwise lost.
  3. Modern tools (LLMs etc) make it feasible to analyze and normalize real-world participant-contributed data at a scale and cost that was previously impractical.
  4. For some under-characterized conditions, a heterogeneous real-world repository can identify candidate biomarkers, phenotypic subgroups, or prospective-study designs and lead to a shorter timeline for protocol development and funding of eventual trials, compared to the status quo.

Here are a few example experiments we can use to validate these hypotheses:

  1. Use heterogenous real-world data to test whether a shared repository can generate an earlier detection hypothesis that existing structures are poorly positioned to generate. For example, exocrine pancreatic insufficiency. EPI is common but underdiagnosed and its current diagnostic pathway is unpleasant (a stool test) and imperfect. As digestion becomes less effective, people often change what and how they eat before anyone labels it as a problem: meal sizes may shrink, fat and protein consumption may drift downward, symptom patterns develop, and glucose variability may shift in parallel, especially in people with diabetes or others who happen to have CGM data. A conventional clinical trial would rarely collect all of these data together (GI symptoms, meal logs, and CGM data). We have shown that a patient-developed symptom survey can effectively distinguish between EPI and non-EPI gastrointestinal symptom patterns in the general population. But this relies on people to know they have a problem and fill out the symptom survey to assess their symptoms. By analyzing CGM and meal logging data, we may be able to create a diagnostic signal from CGM data that may provide an earlier detection of EPI or other digestive problems, noninvasively and much earlier. If successful, the output would be a concrete, testable hypothesis for a future prospective study: for example, that a specific combination of changing meal patterns and glucose variability could serve as an early noninvasive screening signal for EPI.
  2. A second, smaller experiment would test the same infrastructure in a very different setting: under-characterized autoimmune or neuromuscular overlap syndromes. Here the problem is not underdiagnosis at scale, but invisibility in small-n edge cases. Patients with unusual muscle-related symptoms, normal or ambiguous standard workups, and backgrounds that include autoimmune disease often remain isolated within separate clinics and disease silos. One person may carry a type 1 diabetes diagnosis, another Sjögren’s, another a different autoimmune history, and yet their new symptoms may share an overlooked mechanism. In current systems, these cases rarely become legible as a group because the relevant data are scattered across chart notes, patient stories, normal imaging, lab panels, and passive longitudinal data that no one is collecting or comparing systematically. This tests the same core hypothesis under tougher conditions: smaller numbers, less uniform data, less obvious endpoints, and a heavier dependence on narrative context. If the EPI case shows that optional heterogeneous data can support tractable hypothesis generation in an underdiagnosed condition, the neuromuscular case shows why the same infrastructure could matter even more in sparse-data, high-ambiguity situations where it is even harder to capture data to generate evidence in support of and funding for subsequent trials.

What we might learn if these fail

We will learn what the barrier is, whether it is still a problem of infrastructure; a lack of the right people (or AI) leveraging the data; whether partial nonuniform data donation is operationally feasible; and whether limiting factors are data availability, harmonization, governance, or analytic quality. Plus, we can determine whether certain diseases or use cases (e.g. developing novel diagnostics versus assessing medication responses) are better suited than others for this type of platform.

Why now?

Passive and participant-generated data collection is easier. Wearables, phone sensing, CGM data, meal logging apps, symptom trackers, and similar are now significantly more common. Technology makes it easier than ever to create custom apps to track n=1 data or study-specific data. Storage is cheaper. Technology improvements, most notably AI tools, have made it more tractable to collect data; and have made it more tractable for researchers to analyze this data. The remaining bottleneck is capturing, storing, and making the data available. It is less of a technological bottleneck and instead a bottleneck of funding/governance/etc. This is addressable now that the capture/analysis barriers have been lowered!

signal-noise-DanaMLewisI don’t think the question we should be asking is whether every piece of heterogenous data will be useful, but whether we can afford to keep doing what we are doing (throwing away data and still expecting discovery for the populations our current systems and infrastructure already quietly fail).  

Note: I am submitting this post to the Astera essay contest, which you can read about here. You should write up your ideas about the bottlenecks you see and submit as well! I also wrote an additional piece with more details and examples, which you can read here. 

The data we are leaving behind in clinical trials, what it costs us, and why we should talk about it

I talk a lot about the data we leave behind. A lot of this time, this is in the context of clinical trial design. But more recently, it’s also about the data loss that drives bias in AI. And actually…that same bias exists in humans. And seeking to fix one may also fix the other!

To clarify, it’s not that no one cares (as a reason for why data loss or missingness happens). It is an artifact of our systems, funding priorities, and a whole bunch of historical decision making. Clinical trials cost money and are usually designed to answer a question around safety or efficacy, often in pursuit of regulatory approval and eventual commercial distribution. We need that (I’m not saying we don’t!), but we also need funding sources to answer additional questions related to titration, individual variance, medication impacts, more diseases, etc.

A lot of trials exclude people with type 1 diabetes, for example, so it’s always a question when new medications or treatments roll out whether they are appropriate or useful for people with T1D or if there’s a reason why they wouldn’t work. This is because sometimes T1D might be seen as a muddy variable in the study, but a lot of times it’s just a copy-paste decision from a previous study where it did matter that is propagated forward into the next study where it doesn’t matter. There is no resulting distinction between the two: there is simply a lack of representation of people with T1D in the study population and this means it’s hard to interpret and make decisions about this data in the future.

A lot of decisions around what data is collected, or who is enrolled in a population for a study, is an artifact of this copy-paste! Sometimes literal copy-paste, and sometimes a mental copy-paste for “we do things this way” because previously there was a reason to do so. Often that’s cost (financial or time or hardship or lack of expertise). But, in the era of increasing technological capabilities and lower cost (everything from the cost of tokens to LLM to the decreasing cost of storage for data long-term, plus the reduced cost on data collection from passive wearables and phone sensors or reduced hardship for participants to contribute data or reduced hardship for researchers to clean and study the data) this means that we should be revisiting A LOT of these decisions about ‘the way things are done’ because the WHY those things were done that way has likely changed.

The TLDR is that I want – and think we all need – more data collected in clinical trials. And, I don’t think we need to be as narrow-minded any more about having a single data protocol that all participants consent to. Hold up: before you get the torches and pitchforks out, I’m not saying not to have consent. I am saying we could have MULTIPLE consents. One consent for the study protocol. And a second consent where participants can evaluate and decide what additional permissions they want to provide around data re-use from the study, once the study is completed. (And possibly opt-in to additional data collection that they may want to passively provide in parallel).

The reason I think we should shift in this direction is because that, out of extreme respect to patient preferences around privacy and protecting patients (and participants in studies – I’ll use participants/patients interchangeably because it happens in clinical care as well as research), researchers sometimes err on the side of saying “data not available” after studies because they did not design the protocol to store the data and distribute it in the future. A lot of times this is because they believe patients did not consent (because researchers did not ask!) for data re-use. Other times it’s honestly – and I say this as a researcher who has collaborated with a bunch of people at a bunch of institutions globally – inadvertent laziness / copy-and-paste phenomenon where it’s hard to do the very first time so people don’t do it, and then it keeps getting passed on to the next project the same way, because admittedly figuring it out and doing it the first time takes work. (And, researchers may have not included in their budgets to cover data storage, etc.). Thus this propagates forward. Whereas you see researchers who do this on one project tend to continue doing it, because they’ve figured it out – and written it into their research budgets to do so for subsequent projects. And other times, the lack of data re-use is a protective instinct because researchers may have concern that the data may be used in a way that is harmful or negative. Depending on the type of data (genetic data versus glucose data, as two examples), that may be a very legitimate concern of study participants. Other times, it may be at odds with the wishes of the broad community perspective. And sometimes, individuals have different preferences!

We could – and should – be able to satisfy BOTH preferences, by having an explicit consent (either opt-in or opt-out, depending on the project) that is an ADDITIONAL consent specifically indicating preferences around data being re-used for additional studies.

To emphasize why this matters, I’ll circle back to AI/LLMs. A criticism of AI is that they are trained on data that is not representative; has holes; or is biased. This is legitimate and an artifact of the past design of trials. I have been thinking through the opportunities this provides now, in the current era of trial design, to be proactive and strategic about generating and collecting data that will fill these holes for the future! We can’t fix the way past trials were done; but we can adjust how we do trials moving forward, fill the holes, and update the knowledge base. This actually then fixes the HUMAN PROBLEM as well as the AI problem: because these criticisms of AI are the same criticism we should be making about humans (researchers and clinicians etc), who are also trained on the same missing, messy, possibly biased data!

More data; more opt-in/out; more strategery in thinking about plugging the holes in the knowledge base moving forward in part by not making the same mistakes in future research that we’ve made in the past. We should be designing trials not only for narrow endpoints (which we can also do) for safety and efficacy and regulatory approval and commercial availability, but also for answering the questions patients need answered when evaluating these treatments in the future: everything from titration to co-conditions to co-medications to the arbitrary research protocol decisions around the timing of treatment or the timing of the course of treatment. A lot of this could be better answered by data re-use beyond the additional trial as well as additional data collection in conjunction to the primary and secondary endpoints.

All of this is to say…you may not agree with me. I support that! I’d love to know (constructively; as I said, no pitchforks!) what you disagree about, or what you are violently nodding your head in agreement about, or what nuances or big picture things are being overlooked. And, I have an invitation for you: CTTI – the Clinical Trials Transformation Institute – is hosting a 2-hour summit on April 28 (2026) for patients, caregivers, and patient advocates designed to facilitate discussion around the range of perspectives on these topics as they apply to clinical trials. Specifically 1) the use of AI in clinical trials and 2) perspectives on data re-use beyond the initial trial that data is collected for.

If you have strong opinions, this is a great place for you to share them and hear from others who may have varying or similar perspectives. I am expecting to learn a lot! This event is not designed around creating consensus, because we know there are varying perspectives on these topics, and thus especially if you disagree or have a nuanced take your voice is needed! (I may be in the minority, for example, with my thoughts above.) The data we leave behind in clinical trials, what it costs us, and why we should talk about it - a blog by Dana M. Lewis on DIYPS.orgParticipation is focused (because these topics are scoped to focus on clinical trials) on anyone who has ever been part of a clinical trial, is currently navigating one, made the decision to leave a trial early, or even those who looked into participating but ultimately decided it wasn’t the right fit .

You can register for the summit here (April 28, 2026 10am-noon PT / 1-3pm ET), and if you can’t attend, feel free to drop a comment here with your thoughts and I’ll make sure it is shared and included in the notes for the event that will also be shared out afterward aggregating the perspectives we hear from.