The Literature Is a Debate, Not a Record

How to read healthcare AI like the argument it actually is

Presented By Our Partners

By the time the JAMA Internal Medicine paper appeared on a Monday morning in June of 2021, the Epic Sepsis Model had been running quietly in hundreds of American hospitals for years. It calculated a score every fifteen minutes for every patient admitted to a participating system, watching for the chemical and clinical signatures that precede a deadly cascade of organ failure. By the CDC’s current estimates, sepsis is responsible for at least 350,000 American adult deaths each year and roughly 1.7 million adult hospitalizations. Catching it early, before the cascade locks in, is one of the few interventions in critical care where minutes matter, where the right alert at the right time can keep a patient alive. The promise of the Epic model, built by the leading electronic health records vendor among large American health systems, was the promise of healthcare AI itself: a quiet machine, watching what humans cannot watch, catching what humans would miss.

The paper, by a team at the University of Michigan led by Andrew Wong and Karandeep Singh, told a different story. Looking at 38,455 hospitalizations across 27,697 patients over ten months at Michigan Medicine, the researchers found that the model, at the threshold Epic itself recommended, identified only thirty-three percent of the patients who went on to develop sepsis. It missed two thirds. It generated alerts on roughly eighteen percent of all hospitalized patients, the vast majority of whom never developed the condition. Of the patients with sepsis whose clinicians had not already caught the deterioration, the model flagged only seven percent. When the model did fire correctly, it did so a median of two and a half hours before sepsis onset, but subsequent reporting by STAT News would surface why this lead time was less impressive than it sounded: one of the model’s predictive variables was the administration of antibiotics, a thing clinicians order when they have already begun to suspect sepsis. The model was, in a sense, detecting the doctor’s suspicion. By the time the alert arrived, the bedside had already moved.

Epic’s claimed area under the receiver operating curve for the model, drawn from internal testing, ran between 0.76 and 0.83. The Michigan team’s measured value was 0.63. The difference is the difference between a clinically useful prediction and something only marginally better than a coin flip.

None of this was news to clinicians who had been using the tool. Alert fatigue, the slow numbness that descends on bedside staff who see too many false alarms, had already worn through in the hospitals where the model fired. What was new was that someone had finally written down what was happening, in a peer reviewed journal, with the data and the math attached.

How had a model with this profile come to be deployed in hundreds of American hospitals?

This is the question Healthcare Discovery AI wants to teach its readers how to answer. Not for Epic specifically, and not for sepsis prediction specifically, though both will recur in our pages, but for the wider field that calls itself healthcare AI and increasingly asks the public to defer to its claims. The question matters now because the field is growing faster than the apparatus that evaluates it. The FDA’s running list of authorized AI and machine learning enabled medical devices crossed a thousand entries in early 2025 and has continued to climb. A startup ecosystem of several hundred companies has raised significant venture capital on the strength of clinical validation that, in many cases, has not yet meaningfully occurred. Behind them sits a longevity science marketplace that borrows the language of clinical trials without always borrowing the structure of them, and an influencer economy that translates partial findings into definitive recommendations at the speed of a video edit.

The reader who wants to navigate this landscape with any clarity needs a different mental model of what a published paper, a press release, or a peer reviewed claim actually is. The dominant model, the one taught in school and reinforced by every science journalism convention, treats the published literature as a record of facts. A paper appears in a respectable journal. It has been peer reviewed. Therefore the thing it says has been established. Readers move on.

This is, to put it plainly, not how science actually works. It is not how it worked in the case of the Epic Sepsis Model, where the internal performance claim and the deployed reality were separated by years and by the apparent indifference of the publishing system to running the check. It is not how it works in longevity research, where positive trials are sought eagerly by high impact journals while negative trials more often vanish into the file drawer. It is not how it works in healthcare AI, where the gap between retrospective validation on training data and prospective performance in deployment is the gap most likely to determine whether the product harms patients.

There is a different way to read it, one this publication takes as its starting premise. The scientific literature, viewed honestly, is not a record of facts. It is a record of a debate under varying incentive structures. Read it that way, and the picture changes.

Three claims in one sentence

Three claims are doing work in that framing, and each one rewards close attention.

The first is that the literature is a debate. This sounds modest, but it overturns the way most readers approach a scientific paper. The paper is not the end of an inquiry but a turn in a long argument. The authors are making a case. Reviewers, the small handful who agreed to read the manuscript before publication, have raised objections; some of those objections were addressed, others were waved off, and the final published version reflects a negotiation whose details the reader almost never sees. Other research groups, working on adjacent questions, may already have reached different conclusions and be preparing their own papers, which will land in different journals on different timelines. The literature, taken as a whole, is the visible trace of an argument that is still in progress. The publication of any single paper is closer to a court filing than to a verdict.

The second is that the debate happens under incentive structures. Researchers do not pursue questions at random. They pursue questions for which funding is available, on timelines that align with their promotion clocks, using methods that pattern match to what other researchers in their subfield consider legitimate. Journals do not publish papers at random. They publish papers that will be cited, that will not retract under them, and increasingly, that fit the editorial direction they have chosen. Sponsors, when they fund clinical trials, often write the protocols, control the data, and in many therapeutic areas retain the right to delay or shape publication. None of these actors are necessarily acting in bad faith. They are responding to the incentives in front of them, and the published record reflects the sum of those responses, not the sum of the underlying truths.

The third claim is the one that gives the framing its bite. The incentive structures vary. They are not the same in basic biology and in pharmacoeconomics, not the same in academic AI papers and in industry sponsored validation studies, not the same in the New England Journal of Medicine and on a preprint server. A reader who treats all published claims as equally fact shaped will be systematically misled, not because the system is corrupt, but because the system was never designed to deliver facts. It was designed to surface arguments. The reader’s job is to figure out which argument is being made, under which incentive structure, with which evidence held back, and against which counter arguments the authors have positioned themselves.

This is, in our view, the only realistic stance a serious reader of healthcare AI can take. The alternative, accepting published claims at face value because the journal name is respectable and the authors have credentials, has now been shown to fail, repeatedly, in cases where the cost of failure is measured in patients. The Epic Sepsis Model is one such case. Documented instances of healthcare AI tools whose retrospective validation did not survive prospective deployment make a growing class. The pattern, common in longevity science, of dramatic early effects that attenuate as larger and more rigorous trials run is a third. The pattern is consistent enough that it has become predictive: when a healthcare AI claim is most aggressively marketed, that is the moment to ask hardest what the argument actually is.

This is what we mean when we say the literature is a debate. We do not mean the science is corrupt. We mean the science is unfinished, the way arguments are unfinished, and that reading it as a record of finished facts is the single most common mistake the public and the press make about how knowledge in this field is actually produced.

Featured Partner

Invest in the Infrastructure Behind Modern Medicine

As healthcare expands beyond hospital walls, the buildings and campuses supporting that shift are generating compelling returns for investors who move early. The Healthcare Real Estate Fund offers qualified investors direct access to a curated portfolio of medical office, outpatient, and specialty care facilities.

Learn More →

Four structural reasons the literature behaves this way

Publication bias and the literature that does not exist

The most studied phenomenon in this whole picture has a name so familiar that readers usually nod past it: publication bias. The basic finding is that positive trials get published more often, faster, and in better journals than negative trials. The unpublished trials are not missing because no one ran them. They were run, they produced results, and those results were quietly absorbed into corporate decision making, regulatory filings, or academic file drawers without making it into the open literature. Estimates of the magnitude vary by field and cohort, but in clinical medicine, the share of completed trials that have not published their primary results within several years of completion has been measured at roughly a third in large samples of NIH funded work, and higher in some industry funded cohorts.

The implication for healthcare AI is direct. The validation studies we read are not a random sample of all validation work that has been done. They are a curated subset, biased toward results favorable to the developer, the institution that hosts the deployment, or the journal that published. When the only published external validation of an AI model shows that it works well, the responsible reading is not “the model works well.” It is “this is the published version. What else has been run, and where did those results land?” A reader who does not ask this question will, by mathematical necessity, end up with an inflated picture of healthcare AI’s overall performance.

Within longevity science, the bias is amplified by an additional layer: the volume of cell line and animal studies that pattern match to compelling human findings but were never followed by the human trials that would actually test the claim. A compound that extends life in C. elegans, performs well in mice, and shows interesting biomarker effects in a small human pilot has not been established as a longevity intervention. It has been moved to the next stage of an argument. If the argument were going well, there would be larger trials. The reader’s question is always the same: where is the next study, and if it is not there, why not?

Funding induced effect sizes and the disclosure problem

It has been documented, repeatedly and across multiple therapeutic areas, that industry sponsored trials tend to produce results more favorable to the sponsor’s product than non industry sponsored trials of the same intervention. The most rigorous pooled analysis, a 2017 Cochrane systematic review by Lundh and colleagues, found that industry sponsored studies were meaningfully more likely than non industry sponsored studies to report favorable efficacy results, with the relative risk in the neighborhood of 1.3, and substantially more likely to report favorable safety conclusions, with the relative risk closer to 1.9. Subgroups within that analysis showed considerably larger differences.

The mechanisms behind this finding are not, mostly, fraud. They are subtler and more structural. Sponsor funded trials tend to choose comparators that maximize the likelihood of a positive showing for the new intervention. They run on patient populations selected to respond well. They are powered to detect effects on endpoints the sponsor cares about, and not always powered to detect harms the sponsor would rather not surface. They are sometimes terminated early when results look good, locking in a favorable estimate before regression to the mean has had time to operate. None of these moves is unethical in isolation. The sum of them produces a literature in which sponsor favorable findings are systematically over represented.

For the careful reader of healthcare AI, this is the disclosure section’s actual job. Funding statements are not bureaucratic boilerplate. They are the most important paragraph in many papers. The reader who reads them as such, and who notices when a validation study of an AI product is funded by the company that makes the product, has done most of the work of calibration before reading a single result. This does not mean industry sponsored work is wrong. It means industry sponsored work is positioned, and positioning has predictable effects on the results we see.

The harder problem is what is not disclosed. Equity stakes held by senior authors. Advisory relationships with the company under study. Co authorships with consultants whose role in the work is opaque. These do not always appear in the disclosure section, and when they do, they appear in language so technical that most readers skip past. A practical move: when reading a high stakes claim, look up the authors’ recent disclosures across multiple papers. Patterns appear quickly. The literature that emerges from a tight web of overlapping commercial relationships is not the same literature that emerges from independent groups.

What peer review actually does, and what it does not

Peer review is the most misunderstood institution in science. The lay model is that two or three expert reviewers carefully read a manuscript, check the methods, verify the analysis, and certify whether the conclusions follow from the evidence. The reality is closer to this: two or three reviewers, usually working unpaid in evenings stolen from their own research, read the manuscript over a couple of days, raise a handful of objections that the authors then negotiate with the editor, and recommend publication, revision, or rejection. They do not, in general, re run the analysis. They do not, in general, see the raw data. They are not, in general, in a position to detect sophisticated p hacking or selective reporting. They catch obvious errors and major methodological problems. They do not catch most of what readers think they catch.

This is not a criticism of peer reviewers, who are doing their best inside a system that does not adequately compensate them. It is a description of what the institution can and cannot deliver. The journal name on a paper tells you that two or three time pressed experts thought the work was worth publishing. It does not tell you the result will replicate. It does not tell you the analysis is reproducible. It does not tell you the conclusions survive contact with a new dataset. Those questions require additional work, often from groups outside the original team, and the work usually happens, when it happens at all, years after the original paper has shaped the field’s conventional wisdom.

The Epic Sepsis Model is a case in which the additional work happened. The Wong team’s external validation is the kind of paper that, in a healthier ecosystem, would happen routinely and before deployment. The fact that it was newsworthy is, itself, a finding about the system. We treat peer reviewed publication as if it were certification. It is closer to provisional licensing, with the real road test happening, or not happening, somewhere downstream of the journal.

The amplification layer

By the time a published finding reaches the reader, it has typically been amplified through three or four layers of summary. The original paper, with its hedges and caveats, becomes a press release written by the institution’s communications office. The press release, with its softer hedges, becomes a news article. The news article, often with the hedges removed entirely, becomes a social media post. The social media post becomes a podcast soundbite, an influencer recommendation, a Reddit thread, a YouTube short. At each stage, the language hardens. The hedges fall away. The numerical estimates become categorical claims. “May be associated with” becomes “causes.” “In a small pilot study” becomes “shown to.” By the time the claim reaches the reader, the original argument has been compressed into something the original authors might not recognize.

This is not, in most cases, anyone’s fault. It is the result of a chain in which each link is optimizing for clarity, attention, or virality, and the cumulative effect is that the public version of a finding is reliably stronger than the underlying evidence supports. The careful reader’s job, when encountering a healthcare AI or longevity claim in any popular venue, is to walk back up that chain. What was the news article? What was the press release? What did the original paper actually say? What was the sample size, the study design, the comparator, the effect size, the confidence interval? Most of the work of reading healthcare science responsibly is the work of resisting the compression that the amplification layer naturally produces.

This is also where the influencer economy does the most damage. A claim that has been laundered through enough intermediate stages no longer carries any trace of the uncertainty that the original investigators almost certainly built into their conclusions. The recommendation arrives as advice. The reader, having no easy way to walk back up the chain, accepts it or ignores it on something close to vibes. The result is a public discourse about healthcare AI and longevity that bears only an occasional resemblance to the underlying scientific debate. Closing that gap is, in our view, the central editorial project of any serious publication in this field.

How this publication reads

This is the stance Healthcare Discovery AI takes, and it shapes everything that follows from it. We treat published claims as turns in an argument, not as facts. We notice the incentive structure behind each claim. We try to find the missing pieces: the unpublished trials, the unfavorable subgroups, the disclosures the press release omitted, the prospective deployment data that would actually test the retrospective validation claim. When we cannot find them, we say so. Absence of evidence is itself a finding, and treating it that way is one of the few moves that protects readers from getting burned by a field that has not yet learned to police itself.

What this looks like in practice, across the pieces we publish, is a small number of disciplines we hold ourselves to. The first is that we read primary sources, not press releases, and we tell readers when we cannot get to a primary source. The second is that we report effect sizes in absolute terms whenever the data allows, because relative risk reductions are designed to sound larger than they are. The third is that we name the funding structure of every clinical claim we discuss, not as a gotcha but as context. The fourth is that we distinguish, every time, between retrospective performance on training data and prospective performance in deployment, because in healthcare AI those are the two numbers that move in opposite directions most consistently. The fifth is that we are willing to write “we do not yet know,” and to say so as often as the evidence requires, because the alternative, manufacturing certainty where none exists, is the move that turns a publication into a marketing channel for whichever companies push hardest.

We are aware that this stance produces fewer headlines. Articles that say “the evidence is mixed and the prospective data is missing” do not travel as well as articles that say “groundbreaking AI catches sepsis better than doctors.” We accept that tradeoff. The readers we want, the patients trying to make real decisions, the clinicians evaluating tools that will run in their hospitals, the investors writing checks against published claims, the builders who need to know whether the field they are joining is solid ground or sand, are better served by an honest map of the debate than by a confident retelling of one side of it.

We are also aware that the stance has limits. We are journalists, not regulators. We cannot run our own clinical trials. We cannot subpoena unpublished data. We work from the published record, the press releases, the company disclosures, the analyst calls, the public datasets, the FDA filings, the patents, the lawsuits, the conference presentations, and the rare and valuable on the record conversations with researchers who are willing to talk on background. We assemble what we can find, structure the disagreement we observe, and report what we see. When the evidence changes, we change with it. When we get something wrong, we say so in public, in a correction the reader can find. This is, in our view, the only durable basis for a publication in this field.

The Wong paper, in this frame, is the kind of work this publication wants to bring forward, amplify, and translate. It is also a model for how readers might think about all healthcare AI claims they encounter. A model is deployed. Performance is claimed. A press release circulates. An independent group runs the check. A new paper appears. The conventional wisdom shifts. The deployed product is overhauled. This is what the system looks like when it works. Most of the time, the check does not happen, or it happens slowly, and the gap between the claimed performance and the deployed reality persists for years, and the cost of the gap is borne by people who never read the original paper.

The work ahead

The Epic Sepsis Model was, after Wong, gradually overhauled. By the autumn of 2022, STAT News reporting documented that Epic was recommending hospitals retrain the model on their own patient data before clinical deployment, a substantial departure from the earlier one size fits all approach. A later analysis in JAMA Internal Medicine, spanning more than 800,000 patient encounters across nine hospitals, confirmed that the model’s performance varied significantly by site. The literature, eventually, did its work. The model that had been a confident product in 2018 was, by 2023, a tool with stated limitations and a calibration protocol attached.

But the years between the deployment and the correction were years in which the model fired alerts on patients who never developed sepsis, and missed alerts on patients who did. The interval between a claim entering the literature and the literature catching up with the claim is the interval in which the cost of the system’s slowness gets paid. Closing that interval, even by a small amount, is what serious healthcare journalism is for.

We do not promise our readers certainty. We promise the discipline of looking at each claim as what it actually is: a move in a longer argument, made by interested parties, with evidence held back, against counter arguments not yet visible, on a timeline whose end we cannot see. The reader who absorbs that stance is harder to mislead. The reader who absorbs that stance is also, in our experience, a more interesting reader, more curious about the science, more patient with uncertainty, and harder to sell on the next confident claim that has not yet survived contact with the world.

This is the readership Healthcare Discovery AI exists to serve. The pieces that follow will assume it.

Sources and further reading

Wong A, Otles E, Donnelly JP, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine. 2021;181(8):1065 to 1070. doi:10.1001/jamainternmed.2021.2626

Habib AR, Lin AL, Grant RW. The Epic Sepsis Model falls short, the importance of external validation. JAMA Internal Medicine. 2021;181(8):1040 to 1041.

Ross JS, Tse T, Zarin DA, et al. Publication of NIH funded trials registered in ClinicalTrials.gov, cross sectional analysis. BMJ. 2012;344:d7292.

Lundh A, Lexchin J, Mintzes B, et al. Industry sponsorship and research outcome. Cochrane Database of Systematic Reviews. 2017;2:MR000033.

Centers for Disease Control and Prevention. Sepsis: data and reports. Current public estimates of adult sepsis hospitalizations and deaths.

US Food and Drug Administration. Artificial intelligence and machine learning enabled medical devices, list. Updated 2025.

Ross C, Herman B. STAT News reporting on the Epic Sepsis Model, 2021 through 2022, including coverage of the model’s algorithmic inputs and subsequent overhaul.

Free Daily Briefing

The Latest Longevity Science.
Delivered Every Morning.

Join researchers, physicians, and health professionals getting daily breakthroughs in AI-driven medicine, epigenetics, and longevity research.

Support the research that powers this editorial

No spam. Unsubscribe anytime. We respect your inbox.

The Literature Is a Debate, Not a Record

Three claims in one sentence

Invest in the Infrastructure Behind Modern Medicine

Four structural reasons the literature behaves this way

Publication bias and the literature that does not exist

Funding induced effect sizes and the disclosure problem

What peer review actually does, and what it does not

The amplification layer

How this publication reads

The work ahead

Sources and further reading

The Latest Longevity Science.
Delivered Every Morning.

FDA AI in Urology: The Smaller Surgical Side of Medical AI

Three Simultaneous Breakthroughs Are Rewriting the Science of Human Longevity

After the Bitter Lesson

FDA AI in Neurology: Stroke, Seizures, Sleep, and the Measured Brain

The Longevity Toolkit Takes Shape: AI Drug Discovery, Biological Age Clocks, and Microbiome Science Converge

How to Read a Healthcare AI Press Release

Leave a Reply Cancel reply

Three claims in one sentence

Invest in the Infrastructure Behind Modern Medicine

Four structural reasons the literature behaves this way

Publication bias and the literature that does not exist

Funding induced effect sizes and the disclosure problem

What peer review actually does, and what it does not

The amplification layer

How this publication reads

The work ahead

Sources and further reading

The Latest Longevity Science.Delivered Every Morning.

Similar Posts

Leave a Reply Cancel reply

The Latest Longevity Science.
Delivered Every Morning.