Photorealistic title card for The Reproducibility Crisis Healthcare AI Refuses to Talk About, showing a clinical AI research review scene.
|

The Reproducibility Crisis Healthcare AI Refuses to Talk About

A growing body of evidence suggests that a meaningful share of healthcare AI’s published claims would not survive independent reproduction. The field has not yet decided to confront this.

Presented By Our Partners


On New Year’s Day of 2020, Nature published a paper by Scott Mayer McKinney and twenty nine coauthors at Google Health, the United Kingdom’s National Health Service, and several academic medical centers. The paper described an artificial intelligence system that, the authors reported, could detect breast cancer on screening mammograms with accuracy that exceeded the performance of board certified radiologists across two independent datasets, one drawn from the United States and one from the United Kingdom. The performance was substantial: the system reduced false positives by roughly five and a half percentage points on the United Kingdom dataset and false negatives by roughly nine. The team also reported that, when used in a simulated double reading workflow, the system could replace the second human reader entirely without loss of accuracy, with the operational implication that the National Health Service could free up the time of one of the two radiologists who read every screening mammogram in the United Kingdom’s program.

The paper landed at a moment of unusual receptivity. Healthcare AI was already the subject of substantial press coverage. Breast cancer screening was an established, politically sensitive, and operationally demanding part of public health infrastructure across multiple countries. Google was Google. The paper made its way into BBC News, the Canadian Broadcasting Corporation, CNBC, and most of the major outlets that covered medical research at all. It also entered the academic discourse on healthcare AI as a landmark result, cited repeatedly, used as evidence that AI was ready to assume real diagnostic responsibility in radiology, and referenced in pitch decks of healthcare AI startups across the field.

Nine months later, on October 14 of 2020, Nature published a second paper. The new paper was titled “Transparency and Reproducibility in Artificial Intelligence.” Its first author was Benjamin Haibe-Kains of the Princess Margaret Cancer Centre and the University of Toronto. Its coauthor list included Ahmed Hosny, Hugo Aerts, Casey Greene, Anshul Kundaje, Jennifer Pineau, Rob Tibshirani, Trevor Hastie, and twenty additional scientists, including, near the end of the list, John P. A. Ioannidis. The paper’s argument was direct.

The McKinney work, the authors said, was beautiful in theory. In practice, it could not be reproduced. The code was not shared. The training data was not shared. The trained models were not shared. The methodological description, while substantial, was insufficient to allow another research group to rebuild the system and verify what its claimed performance would actually be on data it had not seen.

The dispute that followed, conducted in the pages of Nature over the next several months, was unusually public. McKinney and colleagues replied in a companion paper. Google Health published an Addendum with expanded supplementary methods. Both sides made reasonable points. McKinney’s group pointed out, accurately, that the underlying mammographic data could not be shared for patient privacy reasons and that the proprietary nature of some of the system’s components reflected legitimate commercial considerations. Haibe-Kains and his coauthors acknowledged these constraints and proposed specific workarounds: depositing trained model weights in controlled access repositories, sharing code under appropriate licenses, publishing detailed pseudocode where exact code could not be released, and providing access to derivative datasets that preserved the methodological essentials without exposing patient identities. The exchange did not resolve cleanly. The original paper remained, in the formal sense, unreproduced. Subsequent work by other groups in breast cancer AI proceeded, with varying degrees of methodological transparency, and the field moved on.

This piece is about what that exchange revealed and what it did not. The McKinney case is the most visible single episode of a much larger pattern in healthcare AI, one that has been documented across thousands of papers and is, by the best available evidence, getting worse rather than better as the field scales. The pattern is the reproducibility crisis, and it is the structural reason that a substantial share of published healthcare AI claims should be read with substantially more skepticism than they currently are. The reader who has internalized the prospective and retrospective distinction from this publication’s earlier work is now ready for the next layer of the problem. Even when a study appears to have been done correctly, the published version of the work may not survive contact with an independent group attempting to repeat it. Often, the field will never know whether it would, because the attempt was never made.

The wider crisis

The reproducibility crisis is not specific to healthcare AI. It has, over the last fifteen years, been documented in the empirical literature across nearly every quantitative field that has been carefully examined. The Open Science Collaboration’s 2015 reproducibility project in psychology attempted to replicate one hundred published studies and successfully reproduced the main finding in about thirty nine percent of them. The Reproducibility Project: Cancer Biology, a long term collaboration between the Center for Open Science and several research groups, attempted high profile reproductions in experimental oncology and found, in the formal 2021 reporting in eLife, that less than half of the original effects were reproduced and that effect sizes in the replicated studies were, on average, smaller than the original reports by about seventy percent. Adjacent work in economics, ecology, materials science, and computational biology has produced similar findings, with replication rates that vary by field but consistently sit far below the levels that the conventional reading of published literature would assume.

The structural causes are well understood. The publication system rewards novel positive findings over careful negative replications. Statistical significance thresholds, combined with the flexibility researchers have in defining variables, analyses, and subgroups, create what the statistician Andrew Gelman has called the garden of forking paths, in which the same dataset can yield many different positive results depending on undisclosed analytic decisions. Career incentives push toward producing papers, not toward verifying that the papers in print continue to hold up. The published literature, in any field with a serious reproducibility problem, is a curated subset of the work done, biased systematically toward the version of each project that looked positive at the moment of writing.

In machine learning specifically, the crisis has its own distinctive shape. A 2023 paper in Patterns by Sayash Kapoor and Arvind Narayanan of Princeton University, titled “Leakage and the Reproducibility Crisis in Machine Learning Based Science,” conducted a systematic review of reproducibility issues in ML-based scientific research across seventeen fields. The authors found that data leakage, the contamination of test data with information from training data or other downstream sources that artificially inflates apparent performance, had been identified in published reviews as affecting at least two hundred ninety four papers across these fields. Their taxonomy of leakage identified eight distinct mechanisms, ranging from straightforward train-test contamination to subtle forms in which preprocessing pipelines, target encoding, or temporal leakage produced overoptimistic results that did not survive correction. In one of the paper’s own case studies, on the use of ML for predicting civil war onset, the authors showed that the apparent superiority of machine learning over older statistical methods, which had been a central claim of several published papers in political science, disappeared entirely when the leakage in those papers was corrected.

The authors’ framing of the finding was measured and direct. The reproducibility crisis in ML-based science, they argued, is real, has been independently rediscovered in field after field, and has not yet provoked the kind of structural response that the depth of the problem requires. Documentation alone, in the form of checklists, linters, and best-practice guides, has not been sufficient. The deeper response will require structural changes to how papers are reviewed, what data and code are required for publication, and how the field treats claims that have not been independently reproduced.

The healthcare AI version

Healthcare AI sits at the intersection of the broader reproducibility crisis and a set of constraints that make it harder to address than the crisis in fields where data and code can be freely shared. The structural problems are well documented in the methodological literature on the field.

The first problem is code availability. Most healthcare AI papers do not publish the source code that produced their results. Some publish partial code. Some publish pseudocode. Some publish nothing at all. The reasons offered, when they are offered, include commercial sensitivity, the practical complexity of releasing production code in publishable form, and the unstated but real concern that the published code, if examined carefully, might not produce the published results. The consequence is that a significant fraction of healthcare AI claims cannot be independently reproduced even in principle, because the technical artifact that produced the claim is not available to examine.

The second problem is data availability. Healthcare data is, in many cases, legally and ethically not shareable. The Health Insurance Portability and Accountability Act and analogous regulations in other jurisdictions create real and necessary constraints on patient data movement. The constraint is genuine. It also has the practical effect that the training datasets used to produce healthcare AI claims are often not available for independent verification, and the field has not yet converged on a standard set of workarounds that would allow reproduction without compromising patient privacy. Some groups have moved toward synthetic data, derivative datasets, federated learning protocols, or controlled access through institutional review board approved channels. None of these have become the norm. Most healthcare AI papers still report results on data that no other group can examine.

The third problem is the absence of preregistered analysis plans. Outside of formal prospective clinical trials, which by their nature involve registered protocols, most healthcare AI papers do not preregister their analysis approaches. The authors describe what they did, but not what they had planned to do before they saw the data, and the gap between the two is often impossible to assess from the published paper. This is the garden of forking paths in its healthcare AI form. A team that trained twenty seven model architectures and selected the best performing one for the paper has produced a different kind of evidence than a team that preregistered the architecture and reported its single result. The published paper, in most cases, does not distinguish between these two.

The fourth problem is hyperparameter tuning. The performance of a healthcare AI model depends, sometimes substantially, on choices that have no methodological justification beyond empirical optimization: learning rates, batch sizes, regularization parameters, data augmentation strategies, training schedule details, and ensemble compositions. The published paper typically reports the chosen values without disclosing the search process that produced them. A model that achieves its reported accuracy only at a specific combination of hyperparameters discovered through extensive search of a validation set is operating, in effect, on a multiple comparison problem that the paper does not acknowledge. The reader who has internalized the prior pillar on retrospective validation will recognize this as an extension of the same family of issues. Retrospective validation looks at one of many possible test partitions, with results that may not generalize. Hyperparameter optimization explores one of many possible model configurations, with results that may not generalize either.

The fifth problem is the absence of independent reproduction as a standard part of how the field operates. In experimental physics, independent replication of major results is a routine expectation. In healthcare AI, an independent group attempting to reproduce a published result is an unusual event, and when it happens, the resulting paper often appears in a different venue than the original, with substantially less attention, and is treated as a contribution to the discussion rather than a verdict on the original work. The Haibe-Kains response to McKinney was newsworthy precisely because exchanges of that kind are rare. In a healthier ecosystem, every major healthcare AI claim would face routine independent reproduction attempts within a year of publication, and the results of those attempts would be a primary input into how the original claim was weighted in subsequent work. This is not, currently, how the field operates.

What independent reproduction shows when it happens

The cases where independent reproduction has been attempted in healthcare AI tell a consistent story. The Epic Sepsis Model, which this publication has discussed in earlier pieces, is one such case. The internal validation reported by the model’s developers produced an area under the receiver operating curve between 0.76 and 0.83, in the range of clinically useful performance. The external validation by the University of Michigan team, published in JAMA Internal Medicine in 2021, found an AUC of 0.63, well below the original claim. This was, in effect, a reproduction failure conducted in deployment rather than in the laboratory. The methodological lesson is the same.

Featured Partner

Invest in the Infrastructure Behind Modern Medicine

As healthcare expands beyond hospital walls, the buildings and campuses supporting that shift are generating compelling returns for investors who move early. The Healthcare Real Estate Fund offers qualified investors direct access to a curated portfolio of medical office, outpatient, and specialty care facilities.

Learn More →

The Zech et al. paper on pneumonia detection, also discussed in this publication’s previous work, is a more constructive case. The authors attempted to reproduce the standard claim that convolutional neural networks could be trained to detect pneumonia from chest radiographs with strong cross-institutional accuracy. The reproduction succeeded internally and failed externally, with the failure traceable to site-specific signals that the original literature had not adequately controlled for. The paper is a reproduction in the strong sense: an independent group, using independent data, with a transparent methodology, examined a claim and found that the claim’s underlying mechanism was different from what the original literature had suggested.

Outside healthcare AI, the Kapoor and Narayanan civil war prediction case is the cleanest demonstration of what reproduction can reveal. The original papers in political science had reported that machine learning methods substantially outperformed older statistical approaches at predicting civil war onset. The Princeton team examined the published code, identified leakage, corrected it, and found that the ML methods did not, in fact, outperform the older approaches. The published claim was an artifact of methodological flaws, not of substantive scientific finding. The same kind of correction has been performed across many of the two hundred ninety four papers their review identified, in some cases with similar reversals.

The reader’s question, when encountering any healthcare AI claim, follows from this evidence. Has the claim been independently reproduced? In most cases, the answer is no. The fact that the answer is no does not mean the claim is wrong. It does mean that the claim should be read as preliminary, in the same way that a single positive study in clinical medicine is read as preliminary in any sophisticated medical literature. The strength of evidence comes from accumulation across independent groups, not from the strength of any single result. Healthcare AI, in 2026, has not yet built the accumulation. Most of its published claims sit in the position of single positive studies in fields where multiple replications would be the standard before any operational deployment was contemplated.

What good reproducibility looks like

The field knows what better practice looks like. The practices are not exotic. They are simply uncommon.

A reproducible healthcare AI paper publishes its source code, ideally under an open license that allows full inspection and modification. When the production code cannot be fully released for legitimate reasons, the paper publishes detailed pseudocode and provides controlled access to the actual implementation through a credentialed mechanism that allows independent investigators to verify the work.

A reproducible healthcare AI paper provides access to the training and evaluation data, either directly when patient privacy and commercial constraints permit, or through controlled access protocols when they do not. Synthetic data that preserves the statistical and structural properties of the real data without exposing patient information has become a viable substitute in some categories of work. Federated reproduction protocols, in which an independent group runs the original code against a held out dataset they control, are an emerging alternative.

A reproducible healthcare AI paper specifies its analysis plan in advance, through a preregistration on a platform such as Open Science Framework or ClinicalTrials.gov, before the analysis is conducted. The paper reports the preregistered analysis primarily, with any exploratory analyses clearly distinguished from the preregistered ones.

A reproducible healthcare AI paper discloses the full hyperparameter search that produced the chosen model. The disclosure includes the search space, the number of configurations explored, and the validation set or sets used for selection. The paper acknowledges, where appropriate, the multiple comparison implications of the search.

A reproducible healthcare AI paper invites and acknowledges independent reproduction. Some groups have begun to publish their results with explicit reproduction protocols, model cards, and the documentation needed for other researchers to repeat their work. The CONSORT-AI and SPIRIT-AI guidelines, discussed in this publication’s earlier pillar on prospective validation, include reproducibility-relevant items for clinical trial reporting. The DECIDE-AI guidelines address the deployment evaluation phase. The framework Kapoor and Narayanan propose, model info sheets, would extend similar discipline to non-trial machine learning work. None of these have become the field’s default. All of them are available to authors who choose to use them.

The reader’s method

A short working method, sufficient to operate with on most healthcare AI claims, follows from the foregoing.

When you encounter a healthcare AI claim, look for the code. If the paper provides a public repository with the full implementation, the work is in the small fraction of healthcare AI publications that meet a basic reproducibility standard. If the paper provides partial code, pseudocode, or no code, the work sits in the majority of the field, where reproduction is, in practice, not possible.

Look for the data. The same logic applies. If the data is publicly available or accessible through a controlled mechanism, the claim is in a stronger position than if the data is described but not shared. The reasons for non-sharing may be entirely legitimate. The implication for the reader is the same. A claim that cannot be reproduced is, by definition, more preliminary than one that can be.

Look for the preregistration. If the paper references a preregistered analysis plan, the multiple comparison and garden-of-forking-paths concerns are substantially reduced. If the paper does not reference one, the reported results are, in effect, a single draw from an unknown distribution of possible analyses.

Look for the hyperparameter disclosure. A serious paper discusses the choices it made and the choices it considered. A weaker paper reports only the choices that were ultimately used.

Look for independent reproduction. If the claim has been independently reproduced by a non-affiliated group, the strength of the evidence is substantially higher than if it has not. If the original authors are the only group that has reported the result, the claim is preliminary regardless of the journal in which it appeared. This is the most important single signal in the field.

Look, finally, for what the authors themselves say about reproducibility. The strongest published work acknowledges its reproduction limitations explicitly and proposes mechanisms for verification. The weaker work does not mention the topic. The reader who learns to notice this distinction is operating with one of the most reliable filters in the field.

Back to the McKinney exchange

The exchange in Nature between McKinney and Haibe-Kains, in retrospect, was an unusually constructive moment in healthcare AI’s reproducibility conversation. The original authors did not, after the critique, refuse to engage. They published a reply. They published an Addendum with expanded methodological detail. They acknowledged the legitimate points the critique had made. The field, in the years since, has moved toward broader code and data sharing in breast cancer AI specifically, and several subsequent groups have built more reproducible work in that area. The conversation was not closed by either side declaring victory. It was advanced by both sides treating the question of reproducibility as serious.

The lesson is not that the McKinney work was wrong. The lesson is that, six years after the original paper, the field still cannot say whether it was right, in the strong sense that an independent group has rebuilt the system from the available documentation and verified the published performance on independent data. That is the meaning of the reproducibility crisis in healthcare AI. It is not that the claims are necessarily false. It is that the field has not yet built the verification infrastructure to know which ones are which. The reader who has internalized this point is in a position to read the literature with substantially more accurate calibration than the reader who has not.

The verification intelligence this publication exists to build depends on this kind of reading. The work of separating the healthcare AI claims that will hold up over time from the ones that will not requires asking, of every claim, what would it take to reproduce this, and has anyone done it. The field has not made the work easy. The work is, nonetheless, possible, and the reader who insists on doing it will, over time, develop a substantially better map of healthcare AI than the field’s own marketing is currently incentivized to produce.


Sources and further reading

McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89 to 94.

Haibe-Kains B, Adam GA, Hosny A, et al. Transparency and reproducibility in artificial intelligence. Nature. 2020;586(7829):E14 to E16.

McKinney SM, Karthikesalingam A, Tse D, et al. Reply to: Transparency and reproducibility in artificial intelligence. Nature. 2020;586(7829):E17 to E18.

McKinney SM, Sieniek M, Godbole V, et al. Addendum: International evaluation of an AI system for breast cancer screening. Nature. 2020.

Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine learning based science. Patterns. 2023;4(9):100804.

Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716.

Errington TM, Mathur M, Soderberg CK, et al. Investigating the replicability of preclinical cancer biology. eLife. 2021;10:e71601.

Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8):e124.

Wong A, Otles E, Donnelly JP, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine. 2021;181(8):1065 to 1070.

Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLOS Medicine. 2018;15(11):e1002683.

Pineau J, Vincent-Lamarre P, Sinha K, et al. Improving reproducibility in machine learning research. Journal of Machine Learning Research. 2021;22.

Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine. 2020;26(9):1364 to 1374.

Gelman A, Loken E. The garden of forking paths: why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time. Working paper, Columbia University, 2013.

Free Daily Briefing

The Latest Longevity Science.
Delivered Every Morning.

Join researchers, physicians, and health professionals getting daily breakthroughs in AI-driven medicine, epigenetics, and longevity research.

Support the research that powers this editorial

No spam. Unsubscribe anytime. We respect your inbox.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *