The doctor's blind spot
Artificial intelligence is spreading fast through health care. Making sure it does not inherit medicine's old prejudices will require something decidedly low-tech: human judgment
The numbers are impressive, and they are accelerating. By the end of 2025, America’s Food and Drug Administration had authorized some 1,450 medical devices powered by artificial intelligence, roughly double the count just three years earlier, with a record 295 cleared in 2025 alone. Three-quarters of them sit in radiology departments, reading mammograms and flagging suspicious lesions on CT scans. Others monitor cardiac rhythms, predict which patients are likely to deteriorate, and, increasingly, listen in on doctor-patient conversations to generate clinical notes. The European Union, meanwhile, is beginning to enforce its AI Act, which classifies most health-care algorithms as “high risk” and subjects them to stiff compliance requirements starting in August 2026.
Yet for all this momentum, a nagging question trails behind the technology like a clinical shadow: when an algorithm is trained on data shaped by decades of unequal access, biased referral patterns, and socially determined spending, can it do anything other than reproduce the injustices baked into the system that generated that data? A study published in April in Social Science and Medicine, led by Courtney Lyles of the University of California, Davis, argues that the answer depends on whether anyone bothers to ask the question, and that “anyone” must include far more than data scientists.
The Lyles study, a collaboration between UC researchers, Northeastern University, and Google, proposes what it calls a human-centered approach to evaluating explainable AI (XAI). Explainable AI refers to tools that peel back the workings of a model, showing which variables drove a particular prediction. The premise is straightforward: transparency alone is not enough. An XAI dashboard might reveal that an algorithm weighs certain features heavily, but only a human with relevant contextual knowledge can determine whether those features reflect genuine clinical signal or are merely proxies for structural inequality.
To test this, the researchers assembled an interdisciplinary panel (clinicians, epidemiologists, behavioral scientists, engineers, and data scientists) and tasked them with scrutinizing the outputs of an XAI model applied to medical imaging. The panel’s job was not to audit the code but to interpret the patterns the model surfaced, asking a set of deceptively simple questions. Could a given pattern be an artifact of differences in the dataset? Might a result be linked not to biology but to the way patients from different demographic groups interact with medical devices? Does a finding reflect a social or structural issue masquerading as a medical one?
What they found was that this kind of scrutiny reliably uncovered what the researchers call “shortcut features”: patterns that look clinically meaningful to an algorithm but actually reflect bias in the underlying data. The panel’s diversity was essential. An engineer might spot a data-pipeline anomaly; a social scientist might recognize that a variable correlated with race or poverty was doing the predictive heavy lifting; a clinician might note that the model’s logic made no sense given how a disease actually progresses. The study also recommends including community members and patient advocates, whose lived experience offers insight that credentialed experts may lack.
The importance of this kind of oversight is not merely theoretical. In 2019, Ziad Obermeyer of UC Berkeley and colleagues published a landmark study in Science that examined a commercial algorithm, built by Optum (a unit of UnitedHealth Group), used by hospitals and insurers across America to identify patients with complex health needs who should receive additional care. The algorithm affected decisions for more than 200 million people a year. Obermeyer’s team showed that it was systematically biased against Black patients.
The mechanism was insidious precisely because it was indirect. The algorithm did not use race as an input variable. Instead, it used past health-care spending as a proxy for health needs, a seemingly reasonable choice. But because Black patients in America historically incur lower health-care costs (owing to barriers of access, insurance gaps, and provider behavior, not to lesser illness), the algorithm consistently assigned them lower risk scores than equally sick white patients. The researchers estimated that this single design choice cut the number of Black patients flagged for extra care by more than half. When the team reformulated the algorithm to predict actual health outcomes rather than spending, the racial bias largely disappeared.
The Optum case has become a canonical example in the field, cited in virtually every subsequent discussion of algorithmic fairness in medicine. But its lesson is worth restating, because it illuminates the exact problem the Lyles study seeks to address. The bias did not arise from malice or incompetence. It arose from the choice of a convenient proxy, one that encoded the consequences of structural racism without anyone intending it to. Detecting that kind of bias requires understanding not just how the model works, but how the social world that produced the training data works.
A separate strand of evidence highlights an adjacent risk: the creeping automation of clinical documentation. AI-powered scribes, tools that listen to doctor-patient conversations and generate clinical notes, have been adopted by roughly 30% of physician practices, according to a 2025 analysis published in npj Digital Medicine. UC Davis itself launched such a program in 2024 and found, in a pilot study published in the Journal of Medical Informatics Research, that the AI-generated notes were generally of high quality, with about 95% free from significant errors.
But a 5% error rate in clinical documentation is not the same as a 5% error rate on a quiz. A cross-sectional evaluation presented at the 2026 American College of Physicians’ Internal Medicine Meeting in San Francisco found that AI-scribe notes consistently scored lower than human-written notes across several quality domains, including thoroughness, organization, and usefulness. The researchers, from the University of Washington, concluded that while AI scribes can reduce documentation burden, they cannot yet be treated as a substitute for human review. Columbia University researchers have warned that the speed of adoption has outpaced validation and regulatory oversight.
Here the parallel with the bias question is instructive. The appeal of AI scribes is efficiency: freeing physicians from the administrative burden that consumes nearly half their working hours. But efficiency purchased at the cost of accuracy introduces risk. A misplaced decimal in a medication dose, an omitted allergy, a hallucinated family history: any of these could cascade into real patient harm. And unlike a biased risk-prediction algorithm, whose errors accumulate silently across populations, a documentation error can detonate in a single clinical encounter.
What, then, is to be done? The Lyles study offers a procedural answer: assemble interdisciplinary teams, not as an afterthought but as a structural feature of AI development and deployment. The study’s practical recommendation is that health systems establish standing expert panels, comprising clinicians, data scientists, social scientists, and community representatives, to review XAI outputs before algorithms are rolled into clinical workflows. This is not an audit conducted once before launch and then forgotten; it is a continuous process of interpretation and recalibration, grounded in the recognition that the context in which an algorithm operates is never static.
UC Davis Health provides something of a test case. Its AI governance committee, led by Jason Adams, has been reviewing AI models for several years. A separate initiative, led by Reshma Gupta, has developed a process for systematically evaluating bias in predictive models used for hospital readmission, an approach that considers patient subgroups at each stage of development and deployment. And the UC S.O.L.V.E. Health Tech initiative, which Lyles co-directs, brings together UC researchers from Davis, Berkeley, and San Francisco with private digital-health companies to build equity into product design from the outset.
These are encouraging institutional responses, but they remain voluntary and localized. Regulators are beginning to move. The FDA’s evolving framework for AI-enabled medical devices now emphasizes lifecycle management, bias transparency, and the principle that manufacturers should demonstrate their products are “secure by design.” The EU AI Act goes further, imposing mandatory risk management, technical documentation, data-quality requirements, and human-oversight obligations on providers of high-risk AI systems, a category that encompasses most clinical AI. From August 2027, full compliance will be required for AI systems embedded in medical devices marketed in the European Union.
Yet regulatory frameworks, however well-designed, are inherently retrospective: they set minimum standards and punish failures after the fact. The deeper challenge is cultural. The technology industry’s instinct is to move fast and optimize for scale. Medicine’s ethic, at least in principle, is primum non nocere, first, do no harm. These two instincts collide whenever a hospital purchases a commercial algorithm, integrates it into its electronic health record, and begins using it to triage patients or flag risks. The algorithm arrives as a black box; the institution deploys it under time pressure; clinicians, already stretched thin, lack the bandwidth to interrogate its assumptions.
These tensions are not confined to American research hospitals or European regulators. They surface, often in starker form, in emerging markets where health-tech companies are building AI into clinical operations from scratch. Axenya, a Brazilian health-technology company that operates in occupational health, benefits management, and insurance, offers a case in point. The company uses predictive analytics and AI across its operations, from clinical navigation to claims-cost forecasting. Its leadership has spent considerable time wrestling with precisely the proxy problem that Obermeyer’s Optum study exposed.
Consider the question Axenya faced when designing its population-health dashboards for corporate clients. Employers want granular data on their workforce’s health to manage costs and design better programs. But under Brazil’s LGPD (its data-protection law, broadly analogous to Europe’s GDPR), individual health data is classified as sensitive and subject to severe restrictions. More to the point, sharing individual-level health information with HR departments, even in the name of “health management,” creates the risk that the data will migrate from clinical context to employment decisions: hiring, promotion, termination. The power asymmetry between employer and employee means that consent, even when formally obtained, is legally suspect.
Axenya’s working answer was to position itself as a trusted intermediary: its clinical team accesses the granular data necessary to operate care and feed predictive models, but delivers to employers only aggregated intelligence, risk stratifications, trend lines, and program recommendations, never individual diagnoses or treatment records. The output is actionable for management (contract a mental-health partner, redesign the benefits plan, launch a prevention campaign) without exposing any single employee to scrutiny. In regulatory terms, this is a data-governance architecture. In practical terms, it is a decision about what kind of proxy to offer, and a recognition that the wrong proxy can cause harm even when no one intends it.
The company has also adopted a tiered framework for AI decision-making that echoes the Lyles study’s emphasis on human oversight. Low-risk outputs, such as educational nudges and routine alerts, are fully automated. Mid-tier interventions, like therapy-adherence recommendations, are proposed by AI but reviewed by a clinician before reaching the patient. High-stakes decisions, diagnoses, medication changes, and major contract terms, remain human-led, with AI serving only as support. Each tier has an explicit delegation charter defining what the algorithm can decide on its own and where a human must intervene.
This is not a framework imposed by regulation (Brazil has no equivalent of the EU AI Act for health-care AI, at least not yet). It is a self-imposed discipline, motivated partly by liability awareness and partly by the recognition that in a market where structured longitudinal health data barely exists at population scale, the quality of the data feeding the models is uneven, and the consequences of a biased output can be both clinically and legally severe. When Axenya evaluated a third-party vendor offering AI-powered mental-health monitoring through facial-expression analysis, its internal assessment concluded that the underlying science was too weak and the ethical risks too high to proceed, precisely the kind of human judgment call that no algorithm can make on its own behalf.
What the Lyles study ultimately argues for is a form of institutional humility, a recognition that AI systems, however sophisticated their architecture, are only as good as the data they ingest and the questions humans think to ask of them. The Obermeyer study showed what happens when nobody asks why an algorithm uses spending as a proxy for sickness. The AI-scribe literature shows what happens when efficiency is prized over accuracy. The common thread is not that the technology is flawed by nature, but that deploying it without sustained, interdisciplinary human oversight converts latent risk into realized harm.
This is, in one sense, an old lesson dressed in new clothing. Medicine has always relied on the judgment of trained humans to interpret data, weigh competing evidence, and make decisions under uncertainty. The arrival of powerful algorithms does not eliminate that need; it intensifies it. An algorithm can process a chest X-ray in milliseconds, but it cannot know that the patient in front of it lives in a neighborhood without a pharmacy, or that the training dataset underrepresented people who look like her, or that the variable most predictive of her outcome is one the model was never taught to consider.
The question for health systems, regulators, and the technology companies building these tools is whether they are willing to invest in the unglamorous, resource-intensive work of human review (the interdisciplinary panels, the community engagement, the continuous monitoring) or whether they will treat AI as a labor-saving device and move on to the next deployment. The Lyles study makes a persuasive case that the former path, though slower and more expensive, is the only one consistent with safe and equitable care.
The machines are getting smarter. The question is whether the institutions deploying them will be wise enough to keep watching.


