Precision and Ethics: The Dual Mandate for AI in Healthcare

AI varies from simple algorithms to complex learning systems. In healthcare, recognizing these differences is key to leveraging its benefits and addressing its limitations effectively.

and

Apr 18, 2024

The current landscape of artificial intelligence (AI) is often compared to a modern-day gold rush, characterized by a surge in the emergence of AI startups and companies. With rapid advancements in machine learning, natural language processing, and robotics, investors and entrepreneurs are eager to capitalize on the transformative potential of AI technologies. This has resulted in a crowded field, with new companies appearing in sectors as diverse as healthcare, finance, transportation, and customer service. The race to harness AI is driven by the promise of efficiency gains, novel solutions to complex problems, and the potential for significant economic returns. However, this rapid expansion also raises concerns about ethical implications and the long-term impacts on employment and privacy. As such, while the AI gold rush offers tremendous opportunities, it also requires careful navigation and regulation.

AI companies can generally be categorized into two distinct types based on their approach to innovation and application. The first type focuses on creating novel AI technologies to enhance or improve pre-existing purposes. This often involves deep technical work by engineers who aim to push the boundaries of what AI can achieve within traditional domains, such as improving algorithmic efficiency or developing more sophisticated data processing techniques. The second type of AI company applies existing AI technologies in novel and unexpected ways. Here, product-oriented individuals excel, finding unique applications and markets where AI can offer new solutions or disrupt established practices. This dichotomy can be summarized as "I didn't realize AI could do that!" versus "I didn't realize AI could do that!" While engineers often found the first type of company, aiming to advance the technology itself, product people often found the second type, focusing on leveraging technology in innovative ways. Both approaches are invaluable; the first drives technological progress, while the second expands the practical applications and accessibility of AI across various industries.

The first approach is self-explanatory, but the second requires further explanation. While using existing AI for novel purposes might sound straightforward, it often involves a deep understanding of both the technology and the new domain it is being applied to. Product people excel in this area by identifying gaps or needs in markets not traditionally associated with AI. They creatively repurpose AI tools to solve unique problems, introduce efficiencies, or even create entirely new user experiences. This can include everything from using natural language processing to enhance interactive customer support in retail to deploying machine learning models to predict maintenance needs in manufacturing. This type of innovation doesn't just adapt AI into different settings; it often redefines what we expect from both the technology and the industries it invades, opening up numerous opportunities for businesses and consumers alike.

One particular application that warrants a closer look is the use of generative AI in healthcare as a "health co-pilot." This tool assists both healthcare providers and patients in enhancing decision-making, diagnostics, and patient management through AI-driven insights and recommendations. Here are several ways in which generative AI serves as a co-pilot in healthcare settings:

AI-Powered Medical Consultation Apps - These apps provide medical consultations based on a user's personal medical history and general medical knowledge. They act as health assistants, offering personalized feedback and guidance on symptoms and suggesting when professional care is needed.
Mental Health Chatbots - Many mental health platforms integrate generative AI to power chatbots that provide conversational support. These AI chatbots serve as therapeutic aids, offering initial counseling, helping to manage anxiety and depression, and guiding users through cognitive behavioral therapy exercises.
Symptom Analysis and Guidance Apps - These apps use AI to help users understand their symptoms and suggest next steps, such as self-care strategies or visiting a doctor. The AI interacts with users to collect health information and generates reports that can be used during clinical consultations, enhancing their efficiency.
AI Virtual Assistants in Clinical Settings - These AI systems assist healthcare providers by automating clinical workflows. They can understand spoken commands and manage tasks such as documenting medical records, ordering tests, and handling patient data, freeing up time for direct patient care.

However, what happens when the tools you are using are not designed for the purpose you are applying them to? This mismatch can lead to several challenges. Firstly, there might be inefficiencies or inaccuracies in results because the AI's original algorithms aren't optimized for the new context. For instance, an AI developed for recognizing patterns in financial data might struggle with interpreting medical imaging, leading to errors or slower processing times. Moreover, adapting these tools often requires additional customization or even significant modifications to the underlying AI models, which can be costly and technically demanding. There's also a risk of unexpected behavior from the AI when applied outside its intended scope, which can raise ethical and safety concerns. Thus, while repurposing AI offers exciting opportunities, it also demands careful consideration of the tool’s capabilities and limitations to ensure it can effectively and safely meet the new demands.

The application of general AI (GenAI) technologies in healthcare illustrates this common challenge: using tools that were not originally designed for specific medical purposes. GenAI systems, typically developed for broad applications such as language processing, image recognition, or data analysis, are adapted for healthcare tasks like diagnosing diseases, interpreting medical images, or predicting patient outcomes. This repurposing can lead to issues of fit and functionality. For instance, AI models trained on general datasets may not account for the nuanced variations in medical data or fail to meet the high accuracy requirements critical in medical diagnostics. There are also concerns about the ethical implications, as these general models may not have been developed with the privacy-sensitive designs that healthcare data demands. Additionally, without specific tailoring, GenAI can perpetuate existing biases or introduce new biases, affecting the fairness and efficacy of medical care. Hence, while there are advantages to leveraging GenAI in healthcare due to its versatility and capacity to handle large datasets, it requires significant modifications and vigilant regulatory oversight to ensure it is adapted safely and effectively for medical applications.

As the frontier of artificial intelligence continues to expand, startups are innovating with AI-driven applications designed to aid clinicians and streamline medical documentation processes. But despite the rapid integration of these technologies, there is a palpable sense of caution and skepticism regarding the readiness of generative AI for healthcare applications. The industry faces significant hurdles, including AI's current limitations in managing complex medical queries and emergencies.

Moreover, the application of generative AI raises ethical concerns, particularly the risk of perpetuating existing biases. Research indicates that AI systems, like ChatGPT, can inadvertently reinforce inaccurate and harmful stereotypes, potentially leading to misdiagnosis and unequal healthcare delivery. These issues are compounded in populations that traditionally suffer from healthcare disparities and might rely more heavily on AI for healthcare services.

Andrew Borkowski, chief AI officer at the VA Sunshine Healthcare Network, the U.S. Department of Veterans Affairs’ largest health system, doesn’t think that the skepticism is unwarranted. Borkowski warned that generative AI’s deployment could be premature due to its “significant” limitations and the concerns around its efficacy. “One of the key issues with generative AI is its inability to handle complex medical queries or emergencies,” he told the publication TechCrunch in 2024. “Its finite knowledge base — that is, the absence of up-to-date clinical information — and lack of human expertise make it unsuitable for providing comprehensive medical advice or treatment recommendations.”

Numerous datapoints lend weight to these assertions.

In a study conducted by Long Island University (LIU) in Brooklyn, New York, roughly 75% of drug-related responses from the GenAI chatbot, reviewed by pharmacists, were deemed incomplete or incorrect. In some instances, ChatGPT, a product of OpenAI in San Francisco released in late 2022, provided "inaccurate responses that could endanger patients," according to the American Society of Health System Pharmacists (ASHP) in Bethesda, Maryland, as stated in a press release. The study also discovered that ChatGPT generated "fake citations" when asked for references to back up certain responses. Out of the 39 questions asked to ChatGPT, the research team found only 10 responses "satisfactory" based on their criteria.

Models tend to perform better when they are trained with cases and fed with images, as well as asked closed discrete answers, as opposed to general open questions. A Research Letter published in JAMA in early 2024 compared the results of several prevalent LLM models on two tests. One test was designed by the New England Journal of Medicine (NEJM) and the other by JAMA itself. Both tests comprised a case description, a medical image, and questions such as "What would you do next?"(JAMA) and "What is the diagnosis?"(NEJM), with 4 (JAMA) or 5 (NEJM) answer choices. In this limited context, the performance of the models ranged from the high 40% to the low 80%, with most models scoring around 50%. This is better than mere guessing but not yet reliable enough to trust with one's health.

One particularly harmful way generative AI in healthcare can get things wrong is by perpetuating stereotypes. In a 2023 study out of Stanford Medicine, a team of researchers tested ChatGPT and other generative AI–powered chatbots on questions about kidney function, lung capacity and skin thickness. Not only were ChatGPT’s answers frequently wrong, the co-authors found, but also answers included several reinforced long-held untrue beliefs that there are biological differences between Black and white people — untruths that are known to have led medical providers to misdiagnose health problems. The irony is, the patients most likely to be discriminated against by generative AI for healthcare are also those most likely to use it. People who lack healthcare coverage — people of color, by and large, according to a KFF study — are more willing to try generative AI for things like finding a doctor or mental health support, the Deloitte survey showed. If the AI’s recommendations are marred by bias, it could exacerbate inequalities in treatment.

Moreover, there exists a fundamental challenge with generative AI (Gen-AI) that might render it perpetually unsuitable for certain healthcare applications: its inherent lack of auditability. Neural networks, the backbone of many AI systems, are self-trained through exposure to vast datasets, adjusting their internal parameters in ways that are not transparent or easily understandable. This process effectively makes them "black boxes"—systems where the internal decision-making logic is obscured and inaccessible.

In practice, this means that while we can input data into these systems and receive outputs, there is no straightforward way to understand the exact reasoning behind these outputs. The decisions made by neural networks are based on complex, layered interactions within the network that are not visible to users or developers. Consequently, even if the outputs are consistently accurate or reasonable, there is no guarantee that this performance will sustain under different conditions or with new types of data. This poses a significant risk in healthcare, where decisions often have life-or-death consequences.

The reliance on outputs that appear correct based on past performance can be misleading. Over time, as the AI is exposed to new and varied scenarios, there is no assurance that the patterns it has learned will apply universally. This limitation is critical in a field like healthcare, which requires not only precision but also the ability to adapt and respond to unique individual circumstances.

Moreover, the lack of transparency in AI decision-making complicates regulatory compliance and accountability. Healthcare decisions require rigorous validation and traceability to ensure they meet established ethical and safety standards. In situations where AI fails or produces erroneous results, the inability to audit the decision-making process hinders efforts to determine the cause and rectify the system.

This opaqueness necessitates the development of more interpretable AI models or the integration of mechanisms that can enhance transparency and auditability. Until such advancements are realized, the suitability of Gen-AI for critical healthcare applications remains a contentious issue.

These limitations of large language models are chiefly due to their architecture. Essentially, these models perform statistical concatenations of data, using a vector database constructed and indirectly compressed during the learning process. They serve as optimized interfaces around these advanced vector databases. However, as with all statistics, some data is discarded, making these databases lossy rather than lossless. Over time, the content will degrade somewhat, similar to a photo repeatedly compressed in the JPG format. While this may not be a problem for certain applications, different architectures may be needed to overcome these limitations for healthcare uses.

The conversation around the application of generative AI in healthcare is complex. While there are promising uses, particularly in automating routine tasks and analyzing vast datasets, the technology's full potential is hamstrung by technical, ethical, and regulatory challenges. Ensuring patient safety, improving model accuracy, and maintaining rigorous scientific validation remain paramount as the healthcare industry cautiously navigates this new technological era. As the World Health Organization suggests, implementing robust auditing, transparency, and impact assessments are crucial steps toward responsible AI integration in healthcare.

So how do we use AI safely and responsibly? To answer this question, we first need to understand what we mean when we talk about AI. Artificial Intelligence (AI), Machine Learning, and Neural Networks are terms frequently used in tech discussions, but do they all refer to the same thing?

Artificial Intelligence (AI) has become an integral part of our daily lives. Broadly, AI refers to the capability of machines or computer programs to mimic human intelligence. However, what exactly is Machine Learning? What is a neural network? And how does Deep Learning fit into all of this? Are they synonyms?

These terms are interconnected yet refer to specific aspects of the same field. From the general to the specific, Artificial Intelligence encompasses Machine Learning, which in turn includes neural networks, within which lies Deep Learning. Let’s delve deeper.

Artificial Intelligence is the broad concept of creating innovative and intelligent machines. Machine Learning is how computers learn from data. It represents the intersection of computer science and statistics, employing algorithms to perform specific tasks without explicit programming. Instead, these algorithms recognize patterns in data and make predictions when new data is presented. The learning process can be supervised or unsupervised, depending on the data used to train the algorithms.

Deep Learning is perhaps the most intricate concept. The algorithms used in Deep Learning are generally a type of Machine Learning (which, in turn, is a type of AI). Rather than focusing on the deep learning vs. machine learning dichotomy, it is more useful to focus on the unique features of deep learning within the context of machine learning. These features include its neural network algorithmic structure, reduced need for human intervention, and the need for more extensive data.

Traditional Machine Learning algorithms have a relatively simple structure, including linear regression or a decision tree model. In contrast, deep learning models are based on an artificial neural network. These neural networks have multiple layers and, like human brains, are complex and interconnected through nodes (equivalent to human neurons). Additionally, with deep learning models, feature extraction is automatic, and the algorithm learns from its own errors instead of relying on a software engineer to adjust and manually extract features. These models also require a much larger amount of data. For instance, ChatGPT was trained using internet text databases, including 570GB of data from books, web texts, Wikipedia, and more. To be more precise, 300 billion words were fed into the system. The model, on its own, determined over 175 billion parameters, something clearly impossible for a data engineer.

As for Neural Networks, they refer to a network of nodes or artificial neurons loosely inspired by the neural networks that make up the human brain. Neural network systems function similarly to a chain of neurons in humans that receive and process information. They are built on algorithms found in our brains that aid in their functioning. A neural network interprets numeric patterns that may take the form of vectors. Its primary function is to classify and categorize data based on similarities. A significant advantage of a neural network is that it can readily adapt to changing output patterns. Moreover, it does not need to be adjusted each time based on the input, which can be achieved through supervised or unsupervised learning. In other words, neural networks can be used to create Machine Learning or Deep Learning, depending on the level of human intervention in setting parameters.

To better understand, imagine these concepts as a series of Matryoshka dolls, the traditional Russian nesting dolls that fit one inside the other. The largest doll, containing all the others, represents Artificial Intelligence (AI). It is the broadest field encompassing all other concepts. The next smaller doll is Machine Learning. This fits perfectly within AI, as it is a subcategory that describes how computers can learn from data without necessarily being explicitly programmed to perform a specific task. The third doll, even smaller, represents Neural Networks. These are computing models inspired by the human brain, and are one of the key components that make Machine Learning possible. Finally, the smallest and youngest doll in the group is Deep Learning. This is a type of Machine Learning that utilizes deep neural networks, i.e., with many layers, to learn from large amounts of data.

There’s no silver bullet to create AI applications appropriate for a domain: every field of application has their own technical, legal, and even cultural challenges and expectations. Blindly applying a solution that worked in one domain to another without understanding their differences can be a shortcut to disaster.

This is especially true of healthcare. The archetypal domain for which most contemporary AI solutions were developed is

Fault-tolerant: errors are acceptable as long as average performance is good enough
Well instrumented: the available data describes everything you need to know about the situation
Poorly understood: Systems were developed to optimize a black box process, or one for which you didn’t lose much by seeing it as such
Unregulated or lightly regulated: As an historically significant early application, there are no legal regulation or audit requirements for spam filters.

Healthcare is a domain where the opposite requirements and assumptions apply:

In many healthcare contexts getting things wrong can get somebody killed. Relative to the complexity of the task, medicine might be the domain in which faults are least tolerated by regulators, the industry, and the public. The fundamental oath of medical practitioners is First, Do No Harm. That is not how most AI solutions are engineered.
The human body is enormously complex, and even in the ideal setting of a hospital with advanced diagnostic equipment doctors have to do a lot of educated guessing as to what is going on. The kind of data available for healthcare at scale is simultaneously absolutely necessary and laughably insufficient on its own to understand, model, or predict health trajectories as if they were generic black boxes.
At the same time we do know a lot about medicine. Not remotely as much as we would want to — there are basic aspects of function and disease that are still under active research —- but there are literally thousands of years of cumulative observation and experimentation that have, with an uneven pace, blind spots, and dead ends, built a corpus of knowledge that no feasible amount of data and processing can hope to recapitulate.
There are few domains of human activity more tightly regulated and audited than healthcare. It’s not enough to solve a problem: it must be solved in ways that comply with multiple overlapping, always-shifting, location-dependent institutional and legal requirements that must not only be respected but do so demonstrably.

None of these peculiarities of the healthcare domain imply that we can’t or shouldn’t use AI: an activity as important and complex as medicine demands from us to do it. It does mean that we cannot simply reuse existing technology and processes from other domains with different constraints and requirements. Healthcare is difficult, but what has killed, and will continue to kill, healthcare AI projects is a lack of appreciation of how different it is as well.

It’s a process without shortcuts or simple solutions. There are general rules that must be followed, though — not heuristics to win but the minimal rules to play:

Build as much of the technology and the product as you can. There isn’t at the moment an ecosystem of modern AI components well suited for healthcare, so, for now, to get it right you have to do it yourself.
The foundation of your product’s intelligence must be existing medical practice, not data. This doesn’t forbid the use of data-first AI technologies, but it does put constraints on the overall architecture of a system. The usual meta-architecture of “use data to train a black box and then push new data thru it to choose what to do” doesn’t work.
Auditability must not be a feature but a design principle. Clear and understandable explanations of every thing that was done is a requirement for gaining the trust of regulators, practitioners, and the community.
Privacy, too, is an absolute requirement. Most jurisdictions have legal requirements to protect health information, and hacks and leaks can have repercussions to both the company and the people under its care much more serious than in other industries.
Always be wary of how generalizable are your data sets and even your medical algorithms. Different human groups have different contexts, constraints, and needs, and in a domain where failures have such a large human and organizational cost it’s important to build technological and process guardrails to monitor for biases of all sorts (including the well intentioned bias of applying a solution that is fair and effective in one context to another in which it isn’t).
One implication of those requirements is that AI solutions will by necessity have to be hybrid. There’s no single AI technology that offers simultaneously all the cognitive power we want to deploy and all the careful constraints that we need to respect. Architectures need to mix technologies with an eye towards maximizing power within the constraints of safety, auditability, and privacy - this is different from the usual One Big Model approach in AI. The order of preference among technologies also runs opposite to their “coolness factor:” Well-known explicit rules are preferable to trained models unless the latter are demonstrably better and safer, hand-crafted and fitted models are better than large black boxes, and so on. In practice, there’s always a place for even the most cutting-edge AI technology: specific sub-problems and activities in healthcare where powerful AI models can be safely unleashed to their fullest capabilities. The challenge in getting AI right in healthcare is to build an architecture, and just as importantly, an organizational culture, that makes it possible to safely exploit them as part of an overall system that leverages and respects the knowledge and rules proper to the practice of medicine.

A guest post by

Marcelo Rinesi

AI Architect. Worries a lot.

M/Lib

Discussion about this post