Surrogate Endpoints and the Long Wait for Truth
A measurable proxy for a clinical outcome is a useful tool, until the field forgets that the proxy is not the outcome. Healthcare AI and longevity science have not yet learned this lesson.
On June 7, 2021, the United States Food and Drug Administration granted accelerated approval to aducanumab, the first new treatment for Alzheimer’s disease authorized in nearly two decades. The drug, marketed under the brand name Aduhelm and developed by Biogen in partnership with Eisai, was approved on the basis of a single fact about its biology.
In two Phase 3 clinical trials, aducanumab had been shown to reduce amyloid beta plaques in the brain, measured by positron emission tomography imaging, in a dose dependent and time dependent fashion that placebo could not match. The clinical question, whether removing those plaques actually slowed cognitive decline in patients, had produced conflicting results across the two trials. One had been stopped early for futility. The other had shown a small benefit in one dosing arm only after a protocol amendment changed the analysis. In November of 2020, the FDA’s independent peripheral and central nervous system drug advisory committee had been asked to evaluate the evidence. Of the eleven voting members, ten voted that the data did not show that aducanumab was effective. One was uncertain. None voted in favor.
Seven months later, the FDA approved the drug anyway. The agency invoked its accelerated approval pathway, which since 1992 has permitted authorization based on a drug’s effect on a surrogate endpoint that is “reasonably likely to predict” a clinical benefit, with the condition that a post approval trial verify the actual benefit afterward. The surrogate, in aducanumab’s case, was amyloid plaque reduction. The clinical benefit it was meant to predict was a slowing of cognitive decline. The FDA acknowledged in its own statement that uncertainty about the clinical benefit remained. Three members of the advisory committee resigned in protest. The price of the drug, on launch, was fifty six thousand dollars per year. Medicare announced restrictive coverage limiting reimbursement to patients enrolled in clinical trials. A Congressional investigation, published in January of 2023, was sharply critical of both Biogen and the FDA over the approval process.
Almost three years to the day after the approval, on January 31 of 2024, Biogen announced that it was discontinuing aducanumab. The company terminated its post approval confirmatory trial, called ENVISION, which had been expected to report final results in 2030. The rights to the drug reverted to its original developer. The official reason given was a strategic reallocation of resources to other Alzheimer’s treatments, including lecanemab, the related anti-amyloid antibody this publication discussed in its prior pillar on relative risk.
The unofficial reason was that the drug had been a commercial failure and that the post approval evidence required to verify clinical benefit was unlikely to materialize in a form that would justify continued investment. Some patients had received the drug. Some had experienced amyloid related imaging abnormalities, the technical term for brain swelling and microhemorrhages, at rates well above what the trials had initially reported. The clinical benefit that the surrogate had been “reasonably likely to predict” remained, in the strict sense, unverified.
This piece is about the gap between a surrogate and what it is meant to stand for. It is one of the oldest and most consistently misunderstood patterns in clinical medicine, the kind of error that the field has documented, taught, written textbooks about, and then made again under different circumstances on a roughly twenty year cycle. The pattern is now playing out in healthcare AI and longevity science with renewed force, and the reader who has not learned to recognize it will be reading the next decade of claims at a substantial calibration disadvantage to the reader who has.
What a surrogate is, and what it is not
A surrogate endpoint is a measurable biological or clinical parameter that is meant to substitute for a clinical outcome that is harder to measure, slower to develop, or more expensive to study. The surrogate stands in for the outcome the doctor and the patient actually care about. Blood pressure stands in for cardiovascular events. Cholesterol levels stand in for heart attacks. Tumor shrinkage stands in for cancer survival. Bone density stands in for fracture risk. HbA1c stands in for diabetes complications. Amyloid plaque burden stands in, in some hands, for cognitive decline in Alzheimer’s disease. Each of these surrogates has a story that connects it, with varying degrees of evidence, to the clinical outcome it is meant to predict.
The reason surrogates exist is practical. The clinical outcomes that matter in chronic disease, mortality, disability, hospitalization, quality of life, can take years or decades to manifest. A trial designed to measure them directly is large, slow, and expensive. A trial designed to measure a surrogate that responds within months or a year is faster, smaller, and cheaper. For diseases where existing treatments are inadequate and patients are waiting, the case for using surrogates as evidence is genuine. For diseases where the science is well established and the link between the surrogate and the clinical outcome is robust, surrogates can be appropriate substitutes. The argument is not against surrogates as such. The argument is for clarity about what a particular surrogate has and has not been shown to predict, and for honesty about the gap between intermediate biological effect and meaningful clinical benefit.
The structural problem is that the use of a surrogate requires two distinct empirical claims. The first is that the intervention affects the surrogate. The second is that the surrogate predicts the clinical outcome. The first claim is usually testable in months. The second claim is usually testable only over years, and in many cases the testing has not been done with the rigor that the marketing of the surrogate implies. A drug that lowers a number on a lab test has answered the first question. Whether the lowered number translates into people living longer, healthier lives, or experiencing fewer of the events the lab test was supposed to predict, is the second question, and it is the question that surrogate based approvals, by design, defer.
When the second question is eventually answered, the answers have, with disturbing frequency, not been the answers the field expected. The history of clinical medicine is structurally a history of surrogates that worked beautifully on their measured endpoint and failed catastrophically on the clinical outcome.
CAST: the textbook teaching case
The clearest such case in the modern era is the Cardiac Arrhythmia Suppression Trial, which began enrolling patients in June of 1987. The clinical question was straightforward. Patients who had survived a myocardial infarction frequently developed irregular heartbeats, called ventricular premature depolarizations, in the period that followed. Statistical analyses of survivors had shown that patients with more frequent ventricular premature depolarizations died more often, particularly from sudden cardiac death. The intuition was direct. If the irregular beats were a marker of risk, then suppressing the irregular beats with antiarrhythmic drugs should reduce the risk. Three drugs, encainide, flecainide, and moricizine, had been shown to suppress the irregular beats effectively. The trial was designed to test whether suppressing them would reduce death.
By the time the encainide and flecainide arms of the trial were stopped in April of 1989, two years into enrollment, the answer was clear. The drugs had successfully suppressed the irregular beats. They had also doubled total mortality. In the published analysis of 1,498 randomized patients, 7.7 percent of patients receiving encainide or flecainide had died during the average ten month follow up period, compared with 3.0 percent of patients receiving placebo, a relative risk of 2.5. The arrhythmic deaths the trial had been designed to prevent had occurred in 5.9 percent of the active drug group compared with 2.2 percent of the placebo group, a relative risk of 3.6. The surrogate, suppression of ventricular premature depolarizations, had been successfully achieved. The clinical outcome, survival, had moved in the opposite direction. The third drug, moricizine, was tested in a follow up trial that was stopped early in August of 1991 for the same reason.
The CAST findings were a textbook reversal. They had been preceded by decades in which encainide, flecainide, and similar drugs had been prescribed widely on the assumption that suppressing the markers of arrhythmic risk would prevent the deaths those markers predicted. The exact number of excess deaths caused by widespread prescription of these drugs before the trial began, in the years when the surrogate was treated as if it were the clinical outcome it was meant to predict, has been estimated at tens of thousands in the United States alone. The estimates vary. The order of magnitude does not. A class of drugs that had been widely adopted on the strength of a plausible mechanism and a surrogate endpoint was found, when the actual clinical outcome was measured, to be killing the patients it was supposed to save.
The lesson, which the field absorbed and is still teaching in medical schools, is that a surrogate is not a clinical outcome. The surrogate is a hypothesis about a clinical outcome. The hypothesis can be wrong, sometimes catastrophically. The cost of being wrong is paid by patients who took the drug on the assumption that the surrogate stood for what it was claimed to stand for. The discipline of waiting for the clinical outcome, of running the trial that takes years rather than months, is what protects patients from the version of the hypothesis that does not survive contact with the world.
The HDL story: a structural pattern
The CAST trial could be read as a single dramatic case. A more useful case for understanding the structural shape of the problem is the long, expensive, and unresolved story of HDL cholesterol as a surrogate endpoint for cardiovascular disease.
For decades, observational epidemiology had established that people with higher levels of high density lipoprotein, the so-called good cholesterol, had lower rates of cardiovascular events. The association was robust, replicated across populations, and biologically plausible given HDL’s role in reverse cholesterol transport. The intuition followed the same path as CAST. If higher HDL was associated with fewer events, then drugs that raised HDL should reduce events. Several pharmaceutical companies built large drug development programs around this premise. The drugs, called cholesteryl ester transfer protein, or CETP, inhibitors, were designed to raise HDL substantially.
Torcetrapib, the first of the class, was developed by Pfizer. Its Phase 3 trial, called ILLUMINATE, enrolled fifteen thousand patients at high cardiovascular risk. Torcetrapib raised HDL by approximately seventy percent compared with placebo, a striking effect on the surrogate. In December of 2006, Pfizer terminated the trial because patients on torcetrapib were dying more often. The drug was abandoned. Roche’s dalcetrapib, the second of the class, raised HDL by approximately thirty percent in its dal-OUTCOMES trial. The trial was stopped for futility in 2012. The drug did not reduce cardiovascular events. Lilly’s evacetrapib, the third, raised HDL substantially in the ACCELERATE trial. The trial was stopped for futility in 2015. Merck’s anacetrapib, the fourth, was tested in the REVEAL trial published in 2017. It produced a modest reduction in cardiovascular events, but the effect was attributed by many subsequent analyses to its concurrent LDL lowering rather than to its HDL raising. Merck did not pursue commercial development.
The total cost of the four programs, across the four companies, has been estimated at well over five billion dollars. The clinical lesson, in retrospect, was that HDL was probably a marker of cardiovascular risk rather than a cause of it. The people with higher HDL were healthier. The HDL was not, primarily, what was making them healthier. Drugs that raised HDL specifically, without affecting the underlying causal factors, did not reproduce the protection that natural variation in HDL had appeared to predict in epidemiology. The surrogate had reflected an association without reflecting a mechanism.
This is the structural pattern surrogates produce when the field forgets to test the second of the two empirical claims. A statistical association in observational data is treated as a causal mechanism, drugs are designed around the mechanism, the drugs successfully affect the surrogate, and the clinical outcome moves elsewhere or not at all. The HDL story took roughly twenty years to play out. The drugs in question are no longer on the market. The hypothesis that motivated them is no longer a serious target for new development. The cost of the long wait was paid in capital and in opportunity, and to a smaller degree in the patients who took the drugs in the trials that did not work out.
Oncology and the accelerated approval pathway
The contemporary version of the surrogate question, the one that affects the most patients and the most healthcare spending in 2026, is the use of progression free survival as a surrogate for overall survival in cancer drug approvals.
Featured Partner
Invest in the Infrastructure Behind Modern Medicine
As healthcare expands beyond hospital walls, the buildings and campuses supporting that shift are generating compelling returns for investors who move early. The Healthcare Real Estate Fund offers qualified investors direct access to a curated portfolio of medical office, outpatient, and specialty care facilities.
Learn More →Progression free survival is the time from randomization until the disease either gets measurably worse or the patient dies. Overall survival is the time from randomization until the patient dies. The first is, by construction, easier to demonstrate in a short trial. Tumors can be measured serially. Progression can be defined by predetermined criteria. A drug that delays progression by three months produces a measurable difference within the trial period. A drug that improves survival by three months requires waiting until enough patients have died to detect the difference, which can take years longer.
The FDA’s accelerated approval pathway has, since 1992, allowed cancer drugs to be approved on the basis of progression free survival, or in some cases on the basis of tumor response rate, with the requirement that overall survival be verified in a post approval trial. The pathway has authorized many drugs that subsequent trials showed produced no overall survival benefit, in which case the accelerated approval is supposed to be withdrawn. Withdrawal has happened in some cases, with delay. In other cases, the drugs have remained on the market despite negative confirmatory trials, sometimes for years.
A 2019 analysis in JAMA Internal Medicine by Bishal Gyawali and colleagues, examining oncology drug approvals based on surrogate endpoints between 2008 and 2012, found that of the surrogate based approvals where overall survival data were eventually published, less than half showed a statistically significant overall survival benefit. A separate body of work has documented that the correlation between progression free survival and overall survival, across cancer types and across drug classes, is often weaker than the regulatory practice assumes. The surrogate is sometimes a good predictor of the clinical outcome. It is sometimes not. The accelerated approval pathway treats it as if it were always a good predictor, with the verification deferred to post approval trials that, in many cases, never produce conclusive data within a timeframe useful to the patients who took the drug during the years of uncertainty.
This is not, again, an argument against accelerated approval. It is an argument for the reader to know which approvals were based on surrogates, which surrogates have been independently validated against clinical outcomes, and which have not. The reader who knows that a cancer drug was approved on tumor shrinkage and never demonstrated an overall survival benefit is reading a different evidence base than the reader who knows only that the drug is FDA approved. The two readings produce different inferences about what the drug can do for a particular patient.
Healthcare AI and longevity science
The surrogate problem in healthcare AI and longevity science is, in this publication’s view, the central methodological challenge of the next decade in those fields. Almost every measurement these fields produce is a surrogate. The wearable devices that report heart rate variability, sleep stages, recovery scores, and stress metrics produce surrogates for clinical outcomes that the devices themselves cannot directly measure. The continuous glucose monitors that report time in range and glucose variability produce surrogates for diabetes complications that develop over years. The biological age tests that report methylation age, telomere length, or composite epigenetic scores produce surrogates for actual longevity. The longevity supplement industry, which generates billions of dollars in revenue annually, runs almost entirely on biomarker surrogates that have not been validated against the lifespan and healthspan outcomes the marketing implies.
The structural shape of the surrogate problem is recognizable across these categories. A measurable parameter is correlated, in observational data, with a clinical outcome of interest. The correlation is taken to imply that intervening on the parameter will improve the outcome. Products and interventions are designed to move the parameter. The parameter moves. The clinical outcome, in many cases, is not measured at all, or is measured only in small, short term studies that lack the statistical power and follow up duration to detect what the surrogate is meant to predict. The marketing claims the surrogate effect as evidence of the clinical benefit. The reader receives a number and does not, in most cases, ask whether the number is a measurement or a hypothesis.
The careful reader of any healthcare AI or longevity claim should now have a question prepared. What is the surrogate, and what clinical outcome is it claimed to predict? The two are different things. The first is, in most cases, an empirical claim about a measurement. The second is, in most cases, an unverified hypothesis about the relationship between the measurement and the outcome the patient cares about.
Why surrogates persist
The persistence of surrogate based claims, despite the long history of surrogates that did not predict what they were claimed to predict, has structural causes worth understanding.
The first is that surrogates produce papers, drugs, and products on timescales that match the career incentives of researchers, the development timelines of companies, and the budgeting cycles of investors and payers. The clinical outcomes the surrogates substitute for take longer. A field that waited for the clinical outcomes would produce fewer publications, fewer approvals, and fewer products. The structural pressure is for surrogates, even when the surrogates have not been independently validated as predictors of the outcomes that matter.
The second is that the regulatory system, in its current form, accommodates surrogates explicitly. The FDA’s accelerated approval pathway, the European Medicines Agency’s analogous conditional approval, and the FDA’s 510(k) device pathway all permit authorization on the basis of evidence that does not directly demonstrate clinical benefit. The pathways were designed in good faith to balance the urgency of patient need against the cost of long term verification. The pathways have, in practice, produced a regulatory landscape in which a substantial fraction of approved products rest on surrogates that may or may not eventually be validated.
The third is the amplification chain we have written about in this publication’s anchor essay. Surrogate based claims travel through press releases, news articles, and marketing materials with the surrogate often treated as if it were the clinical outcome. The press release describes a drug that “improves cognition” when the trial measured plaque burden. The news article describes a wearable device that “improves longevity” when the device measures sleep stages. The marketing copy describes a supplement that “slows aging” when the underlying study measured methylation age. The surrogate becomes the clinical outcome in the public version of the claim. The reader at the end of the chain has no easy way to walk back to the original measurement and to ask what the surrogate actually demonstrated.
The fourth, and most consequential for healthcare AI, is that the surrogate question is harder to ask than it appears. Many surrogates have partial validation. Many have validation in some populations and not others. Many have changed in their relationship to the clinical outcome as the underlying disease, the available treatments, or the standard of care has evolved. A surrogate that predicted an outcome reliably in 1995 may not predict it reliably in 2026, because the population, the treatment landscape, and the outcomes themselves have shifted. The question requires evidence, not just intuition.
The reader’s method
A short working method, sufficient for most healthcare AI and longevity claims, follows from the foregoing.
When a claim describes an intervention as improving a measurable parameter, ask whether that parameter is a clinical outcome or a surrogate. The clinical outcome is what the patient experiences: living longer, feeling better, avoiding hospitalization, avoiding disability. The surrogate is a measurement that is claimed to predict the clinical outcome. The two are not the same. They should not be conflated, even when the marketing copy does so.
When the parameter is a surrogate, ask what evidence connects it to the clinical outcome. The evidence may be strong, with decades of validation, multiple randomized trials, and an established mechanism. The evidence may be weak, with observational associations only, no randomized trial verification, and a plausible but unproven mechanism. The strength of the surrogate’s connection to the outcome determines the inferential weight of the claim.
When the evidence is weak, treat the claim as preliminary, regardless of how confidently the marketing presents it. A surrogate without independent validation is a hypothesis about a clinical benefit, not a demonstration of one. The hypothesis may be correct. It may also, on the historical record, not survive contact with a properly designed clinical outcome trial.
When the claim is about a healthcare AI product, look for the clinical outcome the product is claimed to improve, and check whether the published evidence supports that claim or supports only a related surrogate. An AI tool that detects more cases of a condition on imaging has demonstrated something about the surrogate of detection. Whether the additional detection translates into better outcomes for the patients detected is a separate question that the detection performance, on its own, does not answer.
When the claim is about a longevity intervention, recognize that almost the entire field operates on surrogates, that most of those surrogates have not been validated against actual lifespan or healthspan outcomes, and that the relationship between the surrogates and the outcomes is, in most cases, an open scientific question rather than a settled one.
Back to aducanumab
The aducanumab story, in retrospect, was a textbook surrogate failure. The drug had been approved on the strength of one of the two empirical claims required for a surrogate-based approval to be sound. The drug affected the surrogate. Amyloid plaque burden, measured on PET imaging, did go down. The second empirical claim, that the surrogate predicted the clinical outcome of cognitive decline, was, at the moment of approval, a hypothesis. The hypothesis was reasonable. It was based on decades of mechanistic work tying amyloid to Alzheimer’s pathology. It also had a long history of failed drugs that had reduced amyloid without affecting cognition. The clinical question, when the post approval evidence was eventually weighed, was answered in a way that did not support continued commercial development of the drug.
The reader who, in 2021, had asked whether the approval was based on a surrogate, and whether the surrogate had been independently validated as a predictor of the cognitive outcome the drug was being marketed to address, would have been operating with substantially better calibration than the reader who treated the approval as evidence that the drug worked. The same question, applied to every healthcare AI and longevity claim that crosses the reader’s path, will produce the same calibration advantage.
The lecanemab story, which this publication discussed in its prior pillar on relative risk, is unfolding under different conditions. The drug has been approved on the strength of a measured effect on the clinical outcome of cognitive decline, not on the surrogate of amyloid burden alone. The size of the effect, and whether it justifies the cost and risk profile, is the subject of ongoing debate. The frame of the debate, however, is different from the frame of the aducanumab debate. The clinical outcome was measured. The question is whether the measured effect is clinically meaningful, not whether the underlying biology can support the claim. These are different questions, and the distinction between them is the distinction between a surrogate based approval and a clinical outcome based one.
The discipline of asking which kind of approval, which kind of evidence, and which kind of empirical claim is on the table is one of the most reliable tools the careful reader of healthcare AI and longevity science has. The next pillar in this series will examine what happens to that discipline when the financial incentives behind a claim push against the reader’s interest in answering the question honestly.
Sources and further reading
CAST Investigators. Preliminary report: Effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. New England Journal of Medicine. 1989;321(6):406 to 412.
Echt DS, Liebson PR, Mitchell LB, et al. Mortality and morbidity in patients receiving encainide, flecainide, or placebo: The Cardiac Arrhythmia Suppression Trial. New England Journal of Medicine. 1991;324(12):781 to 788.
Moore TJ. Deadly Medicine: Why Tens of Thousands of Heart Patients Died in America’s Worst Drug Disaster. Simon & Schuster, 1995.
Barter PJ, Caulfield M, Eriksson M, et al. Effects of torcetrapib in patients at high risk for coronary events (ILLUMINATE trial). New England Journal of Medicine. 2007;357(21):2109 to 2122.
Schwartz GG, Olsson AG, Abt M, et al. Effects of dalcetrapib in patients with a recent acute coronary syndrome (dal-OUTCOMES trial). New England Journal of Medicine. 2012;367(22):2089 to 2099.
Lincoff AM, Nicholls SJ, Riesmeyer JS, et al. Evacetrapib and cardiovascular outcomes in high-risk vascular disease (ACCELERATE trial). New England Journal of Medicine. 2017;376(20):1933 to 1942.
HPS3/TIMI55-REVEAL Collaborative Group. Effects of anacetrapib in patients with atherosclerotic vascular disease. New England Journal of Medicine. 2017;377(13):1217 to 1227.
US Food and Drug Administration. Aducanumab (Aduhelm) approval letter and accelerated approval documentation, June 7, 2021. See also the corresponding advisory committee meeting documentation, November 2020.
Belluck P, Robbins R. Coverage of the aducanumab approval and post approval period, The New York Times, 2021 through 2024.
US House Committee on Oversight and Reform and Committee on Energy and Commerce. Joint staff report on the aducanumab approval process, January 2023.
Gyawali B, Hey SP, Kesselheim AS. Assessment of the clinical benefit of cancer drugs receiving accelerated approval. JAMA Internal Medicine. 2019;179(7):906 to 913.
Prasad V, Kim C, Burotto M, Vandross A. The strength of association between surrogate end points and survival in oncology: a systematic review of trial-level meta-analyses. JAMA Internal Medicine. 2015;175(8):1389 to 1398.
Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Annals of Internal Medicine. 1996;125(7):605 to 613.
van Dyck CH, Swanson CJ, Aisen P, et al. Lecanemab in early Alzheimer’s disease. New England Journal of Medicine. 2023;388(1):9 to 21.
US Food and Drug Administration. Accelerated approval program. Guidance document and historical authorization data, accessed 2026.
For the broader methodological treatment of surrogate endpoints, see Ellenberg SS, Hamilton JM. Surrogate endpoints in clinical trials: cancer. Statistics in Medicine. 1989;8(4):405 to 413; and Fleming TR. Surrogate endpoints and FDA’s accelerated approval process. Health Affairs. 2005;24(1):67 to 78.
