Title card for After the Bitter Lesson showing a photorealistic healthcare AI strategy scene with clinical data and foundation model imagery.
|

After the Bitter Lesson

How healthcare AI actually gets built in the foundation model era

Presented By Our Partners

On a Wednesday in March of 2023, researchers from Microsoft and OpenAI posted a paper to arXiv with a result that should have shaken the foundations of every healthcare AI startup in the world. They had run GPT-4, a general purpose language model trained on the open internet with no specialized medical content, against the United States Medical Licensing Examination. The model passed by more than twenty points. It also outperformed Med-PaLM, Google’s purpose built medical AI system that had been carefully fine tuned on medical literature and clinical question and answer datasets. The general model, with no medical instruction at all, beat the specialist.

A short while later, in a paper published in Nature Medicine the following year, Google’s response, Med-PaLM 2, caught back up. But the way it caught up matters. Med-PaLM 2 was not built by adding more medical knowledge to the previous Med-PaLM. It was built on a newer general purpose base model, PaLM 2, with the same kind of medical fine tuning the original Med-PaLM had used. The gains came from the base. The specialist caught up not by becoming more specialized but by being built on top of a better generalist.

Five months after the GPT-4 paper appeared, in August of 2023, Babylon Health filed for Chapter 7 bankruptcy in the United States. Babylon had been founded in 2013 in London with a vision of replacing the general practitioner appointment with an AI symptom checker. At its peak, after a 2021 SPAC merger, the company was valued at 4.2 billion dollars. It served patients in the United Kingdom under contracts with the National Health Service, in the United States under commercial insurance deals, in Rwanda under a national health partnership, and across more than a dozen other countries. By 2022, revenue had reached 1.1 billion dollars on a net loss of 221 million dollars. By the autumn of 2023, the company had collapsed. Its US operations entered Chapter 7 liquidation. Its UK operations were sold for five hundred thousand pounds.

There were many causes of Babylon’s collapse. Its business model lost money on most patients. Its clinical validation studies were repeatedly criticized in The Lancet and by UK regulators. Its public claims about its AI’s diagnostic accuracy were challenged by clinicians at the National Health Service. Its SPAC was timed badly for the public markets. But underneath those proximate causes sits a structural fact that the field has been slow to absorb. Babylon’s diagnostic and triage system was built on the painstaking encoding of medical knowledge into a custom AI architecture, with rule based reasoning layered with statistical methods, fine tuned by clinical experts over a decade. By the time the company collapsed, much of what its specialized system had claimed to do was already being done, more flexibly and at lower cost, by general purpose foundation models that had not been built for medicine at all.

This is the bitter lesson. It has been visible in artificial intelligence for seventy years. It has now played out, in text and reasoning, in healthcare AI. The case studies are no longer forecasts. They are history. The interesting question is no longer whether the bitter lesson applies to medicine. It is what gets built on the other side of it.

Sutton’s essay

The phrase comes from a short essay by Richard Sutton, a Turing Award winning AI researcher at the University of Alberta and DeepMind, posted to his personal website in March of 2019. The essay runs about a thousand words. It is one of the most influential documents in modern AI. Its core claim is direct.

The biggest lesson from seventy years of artificial intelligence research, Sutton wrote, is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason is Moore’s law, or more precisely the continued exponential decline in the cost of computation. Most AI research is conducted as if the computational resources available to a system were fixed, which makes encoding human knowledge into the system seem like the obvious path to better performance. But over even moderately long time horizons, vastly more computation becomes available, and the systems that scale with that computation outperform the systems that do not.

The pattern, Sutton argued, has repeated itself across every major subfield of AI. In computer chess, researchers spent decades trying to encode human chess understanding into evaluation functions. The system that beat Garry Kasparov in 1997, Deep Blue, did so primarily through massive deep search, an approach the chess AI community had treated with dismay. In computer Go, where the search space is too large for brute force to work, researchers spent another twenty years encoding expert pattern recognition into specialized algorithms. AlphaGo, the system that beat Lee Sedol in 2016, did so primarily through self play and learning, with most of the expert encoded pattern recognition abandoned. In speech recognition, the long dominance of hand engineered phoneme models and statistical pipelines was eventually broken by neural networks trained end to end on massive datasets. In computer vision, decades of work on hand crafted features were swept away by deep learning after the 2012 ImageNet result. In natural language processing, specialized models for translation, summarization, and question answering were superseded by general large language models trained on internet scale text.

In every case, the same dynamic. Researchers, drawing on their own understanding of the domain, build systems that embed human knowledge into the architecture. The systems improve incrementally for some years. Then, over a slightly longer time horizon, general methods with more computation arrive and outperform the specialized systems decisively. The researchers who had invested careers in the specialized approach resist for a while, and then capitulate, and then the field reorganizes around the new general method. The lesson is bitter because human knowledge approaches feel satisfying. They reflect our understanding of the domain. They make sense to the people who built them. They also lose, repeatedly, to less satisfying approaches that simply scale.

The bitter lesson is not a theorem. It is an empirical observation drawn from the history of a single field. It admits exceptions. It is contested in details. But the broad pattern is real enough that, by the late 2010s, most serious AI researchers had absorbed it as a working assumption. The frontier labs reorganized around scaling. The investments followed. The wave of large language models that emerged after 2020 was, in a meaningful sense, the bitter lesson’s predicted next chapter.

What had not yet happened, when Sutton wrote his essay in 2019, was the application of this dynamic to healthcare AI. The field was still spending most of its venture capital on the specialized approach. It would take three more years for the pattern to become visible in medicine, and the visibility, when it came, came suddenly.

The healthcare AI version of the story

For most of the last decade, the dominant model of healthcare AI startup development went roughly as follows. A team of clinical experts, machine learning engineers, and domain specialists assembled around a specific medical problem. Sepsis prediction. Diabetic retinopathy detection. Skin cancer triage. Radiology workflow optimization. Drug discovery for a particular therapeutic area. The team gathered a dataset, often through partnerships with health systems or pharmaceutical companies, and built a specialized model. The model architecture was chosen to reflect the structure of the problem. Convolutional neural networks for images. Recurrent networks for time series. Graph neural networks for molecular structures. Expert systems for clinical decision support. The model was trained, validated on retrospective data, run through whatever regulatory pathway applied, and then sold to hospitals, insurers, or directly to consumers as a clinically validated tool.

This is the approach the bitter lesson predicted would lose. The predicted losing has now happened, visibly, with two case studies the field will be teaching as cautionary tales for years.

Consider the case of IBM Watson for Oncology, now the canonical example of the specialized approach failing at scale. IBM, riding the credibility it had built when Watson won the television game show Jeopardy! in 2011, announced that the same technology would be applied to medicine. Watson for Oncology was built in partnership with Memorial Sloan Kettering Cancer Center. The training process, as it later emerged through internal documents obtained by STAT News and published in 2018, was a paradigm case of building human knowledge into a system. MSKCC oncologists worked with IBM engineers to encode their clinical preferences. Synthetic cases, compiled by the doctors rather than drawn from real patient outcomes, were used to teach the system how to recommend treatments. The result was a system that reflected the treatment preferences of a small number of senior oncologists at a single elite cancer center, dressed up with natural language processing and presented as artificial intelligence.

The product failed, repeatedly and visibly. MD Anderson Cancer Center, which had built a separate Watson based oncology tool through a partnership starting in 2012, cancelled the project in 2016 after spending sixty two million dollars. The 2018 STAT investigation revealed internal IBM documents in which Watson Health executives acknowledged that the system was making “unsafe and incorrect” treatment recommendations, including a recommendation to give combination chemotherapy with a drug carrying a black box warning against use in patients with severe bleeding to a sixty five year old lung cancer patient with documented severe bleeding. Hospitals adopted the system, found it unhelpful, and dropped it. IBM eventually divested Watson Health in 2022 to a private equity firm, after years of writedowns and the apparent loss of most of the multi billion dollar investment the company had made in the program.

The post mortems on Watson for Oncology have, mostly, focused on the proximate causes. Bad training data. Synthetic cases instead of real outcomes. Overreliance on a single institution’s preferences. The hype outpacing the delivery. These are all true. They are also surface explanations of a deeper failure. The bitter lesson view is that no amount of better training data, drawn from a wider set of institutions, with cleaner labels, would have saved the approach. The approach was wrong. It was wrong in the same way every other attempt to encode expert knowledge into a specialized AI system has been wrong, going back to the chess programs of the 1970s. The frontier in artificial intelligence was moving toward general models trained at scale, and a system architected around expert encoded clinical reasoning was always going to be eaten by a general model that learned medicine as a small fraction of what it learned about everything else.

The Babylon Health story is the next chapter of the same lesson. Babylon was, in some ways, a more sophisticated attempt than Watson for Oncology. The company built a probabilistic graphical model that encoded relationships between symptoms, diseases, and patient characteristics, fine tuned by clinical experts, and deployed it through a consumer facing chatbot. The clinical machinery underneath was substantial. The team was reputable. The venture capital, in total, exceeded six hundred million dollars. The 2018 claims that the system could match a general practitioner on a UK licensing examination generated international coverage and a wave of regulatory and clinical pushback. By 2021, The Lancet had published a critical analysis of a Babylon sponsored study, finding that the evidence did not support the company’s diagnostic claims. By 2023, the company was bankrupt.

Babylon’s bankruptcy happened in the same calendar year that GPT-4 was released and that the first papers showing general LLMs passing medical examinations appeared. The timing is not coincidental. The world in which it made sense to spend hundreds of millions of dollars building a specialized probabilistic clinical reasoning engine was a world in which no general model existed that could approximate clinical reasoning at scale. That world ended somewhere between 2022 and 2024. Babylon’s product, even if its other problems had been fixed, was now in competition with general systems that did much of what it claimed to do, more flexibly, for a small fraction of the marginal cost per query.

The steepness of the curve

The trajectory of medical AI performance since March of 2023 makes the bitter lesson’s force in healthcare hard to miss. The trajectory has steepened, not slowed, in the three years since.

The standard story, told on the MedQA benchmark drawn from United States Medical Licensing Examination questions, starts with Med-PaLM, Google’s specialized medical AI from late 2022, which reached approximately sixty seven percent accuracy after years of focused work and represented the state of the art for purpose built medical AI. GPT-4, released in March of 2023 with no medical training at all, reached approximately eighty seven percent on USMLE style questions, surpassing the passing threshold by more than twenty points. GPT-4o, released in May of 2024, climbed to approximately ninety percent. GPT-5, released in August of 2025, reached approximately ninety six percent. By early 2026, the standard four option MedQA benchmark was widely considered saturated, with the frontier cluster of models, including OpenAI’s o1 and GPT-5.1, Google’s Gemini 3.1 Pro, and Anthropic’s Claude Opus 4 family, performing at ninety six to ninety seven percent.

That was the easy part. The benchmarks that have replaced MedQA as the meaningful frontier are substantially harder, and the rate of progress on them is, if anything, faster than the rate of progress on MedQA was. GPQA Diamond, a benchmark of graduate level questions in physics, chemistry, and biology designed to be Google proof and to require genuine PhD level scientific reasoning, was introduced in 2023 and considered extremely difficult. By the spring of 2026, the frontier cluster, including OpenAI’s GPT-5.4 Pro at 94.4 percent, Google’s Gemini 3.1 Pro at 94.3 percent, Anthropic’s Claude Opus 4.7 at 94.2 percent, and OpenAI’s GPT-5.2 Pro at 93.2 percent, was clustered tightly enough that this benchmark too was approaching saturation. FrontierMath, a benchmark of expert level mathematics problems on which most professional mathematicians cannot perform without substantial effort, saw GPT-5.5 reach 35.4 percent on Tier 4 problems in April of 2026, up from GPT-5.4’s 22.9 percent six weeks earlier. Humanity’s Last Exam, a benchmark designed to test reasoning at the absolute frontier of expert human knowledge, saw Claude Opus 4.7 reach 46.9 percent without tools and Grok 4 reach 50.7 percent with tools in the spring of 2026.

Featured Partner

Invest in the Infrastructure Behind Modern Medicine

As healthcare expands beyond hospital walls, the buildings and campuses supporting that shift are generating compelling returns for investors who move early. The Healthcare Real Estate Fund offers qualified investors direct access to a curated portfolio of medical office, outpatient, and specialty care facilities.

Learn More →

The Artificial Analysis Intelligence Index, which aggregates performance across multiple frontier benchmarks into a single composite score, captures the shape of the curve cleanly. GPT-5.5, released April 23, 2026 as OpenAI’s first fully retrained base model since GPT-4.5, scored 60 on the index, the highest score ever recorded. Claude Opus 4.7, released one week earlier on April 16, 2026, scored 57. Gemini 3.1 Pro, released in February of 2026, also scored 57. In an internal OpenAI evaluation, GPT-5.5 won or tied against industry professionals at well specified knowledge work tasks across forty four occupations at a rate of 84.9 percent. The release cadence has compressed to weeks. GPT-5.4 shipped on March 5, 2026, GPT-5.5 on April 23, 2026, a span of six weeks, with intermediate Anthropic, Google, and Meta releases in between. The frontier in 2026 is, in OpenAI president Greg Brockman’s framing of the GPT-5.5 release, “a new class of intelligence for real work.”

The shape of the trajectory matters more than any individual benchmark number. In late 2022, the field’s best specialized medical AI was achieving sixty seven percent on a benchmark, and the rate of improvement had been slow for years. In three and a half years, the standard medical benchmark has gone from sixty seven percent to ninety six. New benchmarks designed to be substantially harder have, in their first eighteen to twenty four months, gone from “extremely difficult” to “approaching saturation.” The gains have come, in every case, almost entirely from improvements in the underlying general models, not from more aggressive medical or domain fine tuning. Med-PaLM 2 caught up to GPT-4 by upgrading its base model, not by adding more medical training. The specialized clinical AI systems that exist in 2026 are, with rare exceptions, downstream of foundation model improvements that no specialized approach could have matched on its own.

Standard text question answering is, of course, not the same as deployment. The benchmarks that matter most for clinical use, including multimodal benchmarks that combine clinical narratives with images, are not yet saturated. In domains like mammography, where specialized convolutional neural networks have been refined for years, the latest general purpose models still lag substantially. GPT-5 in 2025 reached approximately fifty to sixty four percent on certain mammography benchmarks where specialized systems exceed eighty percent. Neuroradiology benchmarks show similar gaps. The bitter lesson is playing out at full speed in text, clinical reasoning, and increasingly in unified multimodal architectures. It is playing out more slowly, with more remaining work, in fine grained visual perception. This is the kind of nuance the bitter lesson predicts. Specialized models lose, on average, over moderate horizons, in domains where general methods can plausibly arrive. They lose first and fastest in domains where the data and task structure pattern match the general training. They lose later, or sometimes not at all, in domains where the specialized data is unusual, the perception task is fine grained, or the deployment context is hard for a general model to access.

There is one important caveat the 2026 evidence is making increasingly visible, and it sits at the center of why the work of this publication is more important now, not less. The frontier models, despite their accelerating capability gains, retain meaningful hallucination problems. On AA-Omniscience, an industry benchmark testing accurate factual recall across many domains, GPT-5.5 achieves the highest ever accuracy at 57 percent but carries an 86 percent hallucination rate, meaning that when the model is wrong, it is overwhelmingly likely to be confidently wrong rather than appropriately uncertain. Claude Opus 4.7 carries a 36 percent hallucination rate on the same benchmark. Gemini 3.1 Pro sits at 50 percent. For high stakes domains like medicine, finance, and law, this gap between raw capability and reliability is the operationally important constraint, not the benchmark score. The bitter lesson predicts that general models will continue to gain on specialized models on capability. It does not predict, and the 2026 evidence does not yet support, that this capability translates directly into clinical safety. The verification work this publication exists to do becomes more important now, not less, as the gap between confident capability and reliable safety widens.

For the healthcare AI market in 2026, this means something fairly precise. The text and reasoning capabilities of general models have, in three and a half years, gone from below the passing threshold of the United States medical licensing examination to a level where they are competing with experts on graduate level science questions in physics, chemistry, and biology. Specialized text based clinical AI systems built before this acceleration are in a difficult position. They are competing with general models that have surpassed their performance on the canonical benchmark while requiring no medical specific architecture. The multimodal and visual side of the field is moving more slowly but in the same direction. By the end of the decade, the bitter lesson will likely have played out in much of healthcare AI’s visual and multimodal layer the same way it has played out in text. The companies that survive will be those that anticipated this and built accordingly, with the understanding that the verification, deployment, and safety work is now the operationally hard part of the problem.

This is the second half of the bitter lesson story, the half that is good news. The gains in general capability that the field has seen since March of 2023 are real, and they are reaching patients now in tools that would have taken decades to build with the specialized approach. The diagnostic support, clinical question answering, summarization, and decision support tools that hospitals are deploying in 2026, the patient facing chatbots that pharmaceutical companies and insurance companies are running, the documentation and workflow tools that are starting to relieve clinician burnout, are all downstream of foundation model improvements that have happened on a timeline no specialized approach could have matched. The bitter lesson is bad news for a specific kind of bet. It is good news for the overall mission, and the reader who internalizes the trajectory is in a position to evaluate the field’s next decade with substantially better calibration than the reader who has not.

What this means for the healthcare AI market

A healthcare AI startup pitching investors in 2026 is, in most cases, pitching one of three things. The first kind of pitch is built around a specialized model the team has built for a specific clinical problem, with FDA clearance, retrospective validation data, and a defensible go to market in a specific hospital network. The second kind is built around a foundation model wrapper, where the team has fine tuned a general purpose LLM or vision model for a specific clinical use case and built workflow and integration around it. The third kind is built around proprietary clinical data the company has gathered or licensed, regardless of what model is used to extract value from it.

The bitter lesson view of these three pitches is approximately as follows. The first kind, the specialized model pitch, is in the position the chess AI specialists were in by 1995 and the Go specialists were in by 2014. The next general model upgrade, and given the six week cadence at the frontier in 2026, the next one is rarely more than a quarter away, will likely match or exceed their performance at a fraction of the cost. The specialized clinical AI startup that has spent forty million dollars and four years building a sepsis prediction model is in competition with the general foundation model team that, on a Tuesday afternoon, fine tunes a base model on a sepsis dataset and matches their performance. The competitive dynamics are not symmetric. The specialized team has a sunk cost in their architecture. The general team has none. As compute costs continue to fall and base models continue to improve, the specialized model’s relative position can only deteriorate. The categories where this prediction does not hold are the ones the previous section identified: fine grained visual perception in highly specialized clinical imaging, regulatory pathways that confer durable structural advantage, and deployment moats that are independent of the model.

The second kind of pitch, the foundation model wrapper, is more durable in some respects and less in others. The wrapper does not have the architectural sunk cost. It does have the existential risk that the base model provider, whether OpenAI, Anthropic, Google, or another player, will move up the stack and capture the use case directly. A startup that fine tunes a base model for a clinical workflow is, in some sense, a temporary tenant of that workflow. The longer term value lies in the workflow, integration, regulatory, and data work, not in the model itself. The compounding question is whether the wrapper team can build a durable layer of clinical, regulatory, and operational expertise around the model fast enough to remain valuable as the underlying model gets cheaper and more capable.

The third kind of pitch, the proprietary data play, is the most durable on bitter lesson logic. The general models, however good, need data to fine tune on for specific clinical use cases, and the safety problem at the frontier means that general capability alone is not yet sufficient for high stakes deployment. The companies that hold rare, high quality, well labeled clinical datasets, ideally with longitudinal outcomes attached, are in a position to license or sell into a market that is, by structural necessity, hungry for data. The defensibility of this kind of business is also more straightforward. Data is harder to replicate than architecture. A health system with a longitudinal dataset of a hundred thousand patients followed for ten years has something that cannot be reproduced by any amount of compute spend, and the value of that asset is increasing rather than decreasing as foundation models get better at extracting signal from it.

There are exceptions. There are healthcare AI niches where specialized models continue to make sense, often because the data is unusual enough that general models cannot easily acquire it, or because the regulatory pathway is specific enough that the specialized model has a structural defensibility advantage. The IDx-DR autonomous diabetic retinopathy system is an example. The system is specialized, the model is purpose built, the FDA pathway is De Novo, and the data is fundus images of a particular type from a particular camera. A general model can, in principle, do this task. The infrastructure to deploy it, validate it, and integrate it into primary care is the harder problem, and the specialized incumbent has a meaningful head start. The bitter lesson does not say that specialized systems always lose. It says they lose, on average, over moderate time horizons, in the categories where general methods can plausibly arrive at scale. Some categories are protected by regulatory moats, data scarcity, or deployment complexity. Most are not.

What the bitter lesson does not say

This piece would be incomplete without a careful accounting of what the bitter lesson does not say. The claims that follow are not part of Sutton’s essay, do not follow from the historical pattern, and should not be inferred from the foregoing argument.

The bitter lesson does not say that human expertise is irrelevant to healthcare AI. It says that human expertise embedded into model architectures tends to be outcompeted by general methods. Human expertise embedded into data labeling, deployment workflows, regulatory strategy, clinical integration, and post deployment monitoring is, if anything, more important in the foundation model era than it was before. A foundation model fine tuned for clinical use without expert input on what to fine tune for, on what data, evaluated by what metrics, will produce confident plausible looking output that fails in ways the developers did not anticipate. The bitter lesson is about model architecture. It is not about ignoring clinicians.

The bitter lesson does not say that foundation models are safe to deploy in clinical contexts as they stand. The documented failure modes of large language models in healthcare include hallucination, confident generation of plausible but incorrect clinical recommendations, bias inherited from training data, performance variability across patient subgroups, and brittleness to prompt changes. A comparative benchmarking study published in medRxiv in December of 2025 evaluated leading reasoning models on clinical decision making and safety constraints and found that even the highest performing systems incurred significant safety penalties on specific cases, including pharmacological recommendations that violated documented contraindications. The April 2026 frontier models are more capable than the models evaluated in that study, but as the AA-Omniscience hallucination data make clear, more capable does not yet mean more reliable. A model that scores ninety four percent on PhD level science questions and answers confidently when wrong eighty six percent of the time is not, in any straightforward sense, ready to make autonomous clinical decisions. The validation work that healthcare AI requires applies to foundation model based products as fully as to specialized model based products. Possibly more so, given that general models are deployed across a wider range of use cases for which they may not have been specifically validated.

The bitter lesson does not say that the current generation of foundation models will turn out to be the right base for healthcare. The next generation will be better. The generation after that will be better still. A healthcare AI strategy that bets its architecture on the specific structure of any one current model would already be looking outdated. The right level of abstraction for thinking about the lesson is not “use this model” but “build systems that benefit from improvements in general models without being tightly coupled to any particular one.”

The bitter lesson does not say that all healthcare AI startups are doomed. It says that startups built around the assumption that domain specific architecture is the moat are betting against a strong historical pattern. Startups built around clinical data, deployment infrastructure, regulatory expertise, safety verification, and the orchestration of general models into useful clinical workflows are betting with the pattern. The distribution of outcomes in the field over the next decade will likely reflect this asymmetry.

The reader’s stance

For the reader who wants to evaluate healthcare AI claims with any reliability, the bitter lesson offers a working forecast tool. When a company presents its specialized model as the source of its competitive advantage, ask what happens when a general model fine tuned for the same task arrives. When a company presents its expert encoded clinical reasoning as the source of its accuracy, ask what happens when a general model that learned clinical reasoning as a byproduct of learning everything else is benchmarked against it. When a company says that its system is built on years of medical expertise, ask whether that expertise is embedded in the architecture, where it is fragile to the bitter lesson, or in the data, deployment, and integration layers, where it is durable.

The questions do not yield certain answers. The bitter lesson is a pattern, not a law. Healthcare is a domain with regulatory, ethical, and operational complications that the AI subfields where the lesson has been most clearly demonstrated did not face. The application of the pattern to medicine has now begun in earnest, and the timing of how it plays out in particular categories will remain uncertain for some years. But the direction of the pattern is clear enough to be useful, and the reader who carries the questions into every claim a healthcare AI company makes will be better positioned than the reader who does not.

In The Literature Is a Debate, Not a Record, we wrote that the scientific literature is a debate under varying incentive structures rather than a record of facts. The same framing applies to the healthcare AI market. The pitches a startup makes to investors, the marketing claims it puts on its homepage, the conference presentations it gives, the peer reviewed publications it sponsors, are all moves in an argument the company is making about its own durability. The bitter lesson is a tool for reading those moves. A specialized clinical AI company that does not address the foundation model question, in 2026, is making a particular kind of bet, and the reader who notices the unaddressed question has caught the structural shape of the bet.

A healthier ecosystem would do this work in public. Investors would ask the question. Journalists would ask the question. Hospital procurement teams would ask the question. Patients, indirectly, would benefit. The companies that answered it well would prosper. The companies that did not would fail more cleanly and earlier, with less collateral damage to patients, employees, and the broader credibility of the field. The bitter lesson is not, in itself, bad news. It is the structure of an opportunity. The reader who has absorbed it is operating with information that most of the market does not yet have, and that information is, in this publication’s view, the foundation of the verification intelligence healthcare discovery exists to build.

The work of separating the durable healthcare AI claims from the ones the bitter lesson predicts will fade is the work of the next decade. We intend to do that work in these pages. The reader who reads alongside us, with the seventy year pattern in mind, will see a different field than the field sees itself, and will be better positioned to recognize, among the noise of an industry whose model releases now arrive every six weeks, the products and the companies that are building healthcare AI on the right side of the curve.

Sources and further reading

Sutton R. The Bitter Lesson. Personal website, incompleteideas.net, March 13, 2019.

Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint. March 2023, arXiv:2303.13375.

Singhal K, Tu T, Gottweis J, et al. Toward expert level medical question answering with large language models. Nature Medicine. 2024.

Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172 to 180.

Wang S, et al. Capabilities of GPT-5 on multimodal medical reasoning. arXiv preprint. August 2025, arXiv:2508.08224.

OpenAI. Introducing GPT-5.5. Company announcement and system card, April 23, 2026.

OpenAI. Introducing GPT-5.2. Company announcement, December 2025.

Anthropic. Claude Opus 4.7 model card and benchmark disclosures, April 16, 2026.

Google DeepMind. Gemini 3.1 Pro release documentation, February 19, 2026.

Artificial Analysis. Intelligence Index aggregate frontier model rankings, accessed May 2026.

Vellum, Build Fast With AI, and other independent third party benchmark aggregators reporting on GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro comparative performance, April through May 2026.

Comparative benchmarking of contemporary language models on clinical reasoning and safety constraints. medRxiv preprint, December 2025.

AA-Omniscience benchmark hallucination rate data for frontier models, April 2026, as reported in independent third party benchmark aggregators.

Comprehensive third party MedQA, GPQA Diamond, Humanity’s Last Exam, FrontierMath, and Terminal-Bench leaderboards including vals.ai and LM Council aggregator, accessed May 2026.

Ross C. IBM’s Watson supercomputer recommended unsafe and incorrect cancer treatments, internal documents show. STAT News. July 25, 2018.

Strickland E. How IBM Watson overpromised and underdelivered on AI health care. IEEE Spectrum. April 2, 2019.

Coverage of Babylon Health’s collapse, including Financial Times, Sifted, STAT News, and compilations of the Chapter 7 filing in August 2023 and UK administration in September 2023.

The Lancet correspondence and analysis of Babylon Health’s diagnostic accuracy claims, 2018 through 2021.

For Sutton’s broader theoretical framework, see Sutton R, Barto AG, Reinforcement Learning: An Introduction, MIT Press, second edition 2018.

Free Daily Briefing

The Latest Longevity Science.
Delivered Every Morning.

Join researchers, physicians, and health professionals getting daily breakthroughs in AI-driven medicine, epigenetics, and longevity research.

Support the research that powers this editorial

No spam. Unsubscribe anytime. We respect your inbox.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *