Want signal? We need more noise (looking at the quiet bottleneck)

We need more signal, which means we want more noise. A lot of current scientific infrastructure is designed to minimize messiness: define a narrow question, collect the minimum data required to answer it, standardize the dataset, exclude complicating variables, finish the analysis, publish the result. That approach is understandable. It is also one reason we repeatedly fail the patients who most need evidence: people with unusual responses, multiple conditions, atypical phenotypes, rare diseases, or combinations of features that do not fit cleanly into existing bins. Our systems today are structurally designed to discard or never capture the kinds of heterogeneous, partial, contextual, or longitudinal data that could eventually make critical insights available to us.

My hypothesis is that one important scientific bottleneck is not a lack of intelligence, or even a lack of data in the abstract, but a lack of infrastructure for accepting, preserving, and reusing the kinds of data that fall outside formal trial endpoints and standard clinical workflows. I am proposing a solution: a shared platform that accepts optional, heterogeneous, participant- and clinician-contributed data, which will generate clinically useful subgroup hypotheses that standard trial and registry structures fail to generate.

Here are the quiet bottlenecks in the current system that I run into and that this platform would address:

  1. Atypical, comorbid, and small-population patients are systematically left without usable evidence because current research systems are optimized to suppress heterogeneous data rather than preserve it. (Clinical trials are usually designed around narrow endpoint collection, fixed inclusion and exclusion criteria, and standardized datasets that are tractable for a specific immediate question. Patients are often told there is “no evidence,” when what is really missing is a system that could have captured and preserved evidence that can help with future translation for atypical or edge-case presentations.)
  2. Useful data is routinely discarded because no existing structure is responsible for capturing it. (Clinical trials generally do not want to collect data outside primary and secondary endpoints because of cleaning, storage, analysis, and governance costs. Clinical care systems do not want to ingest large volumes of patient-generated data that clinicians are not expected to review. Disease registries are usually narrow and disease-specific. Journals are poor infrastructure for surfacing up data. There is no home for this data.)
  3. AI and humans are constrained by the same missingness problem: the knowledge base is biased and uneven because we do not capture enough of the right kinds of real-world data. (LLMs and other AI systems are often criticized for bias, holes, and nonrepresentative training data. But this overlooks that clinicians and researchers are also reasoning from the same incomplete literature, selective trial populations, under-collected real-world evidence, and publication-filtered case reports. The bottleneck is upstream of both.)AI now makes it possible to overcome these constraints, but only if the data is collected and made available in the first place.
  1. Current systems privilege uniform completeness over partial but valuable contribution, which causes preventable signal loss. (If not everyone has or is willing to provide the data, it’s not collected. Subsets of participants who are willing and able to contribute additional data, such as wearables, CGM, meal logs, phone sensor streams, or symptom data, are prevented from doing so.)

Why current structures produce these bottlenecks

No single existing institution is responsible for optional data intake, long-term stewardship, and broad downstream access. And different structures prioritize different data capture, with no harmonization between these two settings. Clinical trials are funded, regulated, and staffed to answer bounded questions. Clinical trial incentives reward endpoint completion, analyzable datasets, and publication-ready results. Anything beyond the core clinical trial protocol creates additional costs: more data cleaning, more storage, more governance, and more analytic work that may not directly serve the trial’s primary purpose. Under those conditions, narrowness is rational but it is also systematically lossy. Clinical care systems have a different but related design problem. Electronic health records are built for documentation, care coordination, and billing, not for capturing or allowing participant-generated, future-use data that a clinician does not need for real-time decision making. This illustrates the infrastructure gap: patients may have useful longitudinal data, but there is often no legitimate pathway to store it in a way that supports future analysis. Disease registries and advocacy-group datasets help in some cases, but they are typically narrow in scope, tied to a disease silo, and constrained by the same pressures toward standardization and limited datastreams. Publications are also poorly matched to this problem, because they are optimized for polished outputs, not for preserving heterogeneous data. Much of the infrastructure was shaped from prior underlying constraints (when data capture, storage, and analysis was more expensive).

Testable hypotheses that would address these challenges:

  1. Allowing partial, nonuniform data donation from participants will outperform “uniform minimum dataset only” models for early discovery in rare, atypical, and comorbid populations.
  2. Structured case narratives combined with quantitative data will enable more useful discovery than quantitative data alone for poorly characterized conditions, because the narrative context helps identify mechanistic or phenotypic patterns that are otherwise lost.
  3. Modern tools (LLMs etc) make it feasible to analyze and normalize real-world participant-contributed data at a scale and cost that was previously impractical.
  4. For some under-characterized conditions, a heterogeneous real-world repository can identify candidate biomarkers, phenotypic subgroups, or prospective-study designs and lead to a shorter timeline for protocol development and funding of eventual trials, compared to the status quo.

Here are a few example experiments we can use to validate these hypotheses:

  1. Use heterogenous real-world data to test whether a shared repository can generate an earlier detection hypothesis that existing structures are poorly positioned to generate. For example, exocrine pancreatic insufficiency. EPI is common but underdiagnosed and its current diagnostic pathway is unpleasant (a stool test) and imperfect. As digestion becomes less effective, people often change what and how they eat before anyone labels it as a problem: meal sizes may shrink, fat and protein consumption may drift downward, symptom patterns develop, and glucose variability may shift in parallel, especially in people with diabetes or others who happen to have CGM data. A conventional clinical trial would rarely collect all of these data together (GI symptoms, meal logs, and CGM data). We have shown that a patient-developed symptom survey can effectively distinguish between EPI and non-EPI gastrointestinal symptom patterns in the general population. But this relies on people to know they have a problem and fill out the symptom survey to assess their symptoms. By analyzing CGM and meal logging data, we may be able to create a diagnostic signal from CGM data that may provide an earlier detection of EPI or other digestive problems, noninvasively and much earlier. If successful, the output would be a concrete, testable hypothesis for a future prospective study: for example, that a specific combination of changing meal patterns and glucose variability could serve as an early noninvasive screening signal for EPI.
  2. A second, smaller experiment would test the same infrastructure in a very different setting: under-characterized autoimmune or neuromuscular overlap syndromes. Here the problem is not underdiagnosis at scale, but invisibility in small-n edge cases. Patients with unusual muscle-related symptoms, normal or ambiguous standard workups, and backgrounds that include autoimmune disease often remain isolated within separate clinics and disease silos. One person may carry a type 1 diabetes diagnosis, another Sjögren’s, another a different autoimmune history, and yet their new symptoms may share an overlooked mechanism. In current systems, these cases rarely become legible as a group because the relevant data are scattered across chart notes, patient stories, normal imaging, lab panels, and passive longitudinal data that no one is collecting or comparing systematically. This tests the same core hypothesis under tougher conditions: smaller numbers, less uniform data, less obvious endpoints, and a heavier dependence on narrative context. If the EPI case shows that optional heterogeneous data can support tractable hypothesis generation in an underdiagnosed condition, the neuromuscular case shows why the same infrastructure could matter even more in sparse-data, high-ambiguity situations where it is even harder to capture data to generate evidence in support of and funding for subsequent trials.

What we might learn if these fail

We will learn what the barrier is, whether it is still a problem of infrastructure; a lack of the right people (or AI) leveraging the data; whether partial nonuniform data donation is operationally feasible; and whether limiting factors are data availability, harmonization, governance, or analytic quality. Plus, we can determine whether certain diseases or use cases (e.g. developing novel diagnostics versus assessing medication responses) are better suited than others for this type of platform.

Why now?

Passive and participant-generated data collection is easier. Wearables, phone sensing, CGM data, meal logging apps, symptom trackers, and similar are now significantly more common. Technology makes it easier than ever to create custom apps to track n=1 data or study-specific data. Storage is cheaper. Technology improvements, most notably AI tools, have made it more tractable to collect data; and have made it more tractable for researchers to analyze this data. The remaining bottleneck is capturing, storing, and making the data available. It is less of a technological bottleneck and instead a bottleneck of funding/governance/etc. This is addressable now that the capture/analysis barriers have been lowered!

signal-noise-DanaMLewisI don’t think the question we should be asking is whether every piece of heterogenous data will be useful, but whether we can afford to keep doing what we are doing (throwing away data and still expecting discovery for the populations our current systems and infrastructure already quietly fail).  

Note: I am submitting this post to the Astera essay contest, which you can read about here. You should write up your ideas about the bottlenecks you see and submit as well! I also wrote an additional piece with more details and examples, which you can read here. 

Leave a Reply

Your email address will not be published. Required fields are marked *