Now live in ophthalmology

Blog

Setting the Rules of the Road for AI Healthcare

We wouldn't license a physician who could diagnose accurately but couldn't communicate with patients. So why deploy AI systems that way? Our new Nature Medicine paper proposes a framework for evaluating clinical AI - drawing on lessons from autonomous vehicles, that goes beyond diagnostic accuracy to assess both 'cure' and 'care' behaviours.

Nick de Pennington
Dr Nick de PenningtonFounder & CEO
Assisted by AI
Claude Opus 4.5Claude Opus 4.5
Setting the Rules of the Road for AI Healthcare

Should artificial clinical intelligence systems meet the same professional standards as human clinicians? If so, who sets them?

Until recently this wasn't an issue. Clinical AI systems were software, code, algorithms. They sat squarely under the purview of medical device regulations. That's not to say those involved in developing, assessing, and implementing them weren't actively debating best approaches. But these debates tended to focus on the technicalities of training and testing datasets.

Then ChatGPT happened.

Suddenly the world witnessed how abstract machine learning models could communicate directly with them in their daily lives. They also witnessed how they could be deceived, influenced, and become reliant on the results of billions of matrix multiplications occurring in data centres thousands of miles away.

In healthcare this has profound implications. Of course we're concerned with the technical performance of large language models. Do they correctly diagnose a disease from a given set of symptoms (Google's AMIE system claimed greater diagnostic accuracy than primary care physicians). But I'm sure all clinicians will agree that just as important is how an AI might communicate a diagnosis and what a patient might feel as a result of sharing their symptoms.

New AI models are released every week, each with new claims of performance. These narrow evaluations boast excellence in academic reasoning, advanced mathematics, or software coding. OpenAI's recent healthcare announcements showcased performance in specific scenarios. But when AI moves from analysing data to conducting clinical visits, existing evaluation frameworks don't quite fit.

It's a problem we think about a lot. Last week, my colleague Dr Ernest Lim, other members of the Ufonia team, and collaborators from the University of York, Moorfields Eye Hospital, the London School of Hygiene and Tropical Medicine, and the Singapore Eye Research Institute published a paper in Nature Medicine. It proposes a framework for thinking about this problem, one that borrows from extensive prior research in another high-reliability industry: transportation.

What Autonomous Vehicles Can Teach Healthcare

In the early days of self-driving cars, that industry faced a similar challenge. Everyone saw the opportunity for increasing safety, and knew crashing was bad, but no one agreed what "safe driving" actually meant. Without standardised definitions for core behaviours (lane-keeping, merging, obstacle avoidance) comparing systems or setting regulatory requirements was nearly impossible.

The breakthrough came when the industry developed behavioural taxonomies: shared classifications of what a vehicle needs to do, regardless of which company built it. The Society of Automotive Engineers created SAE J3016, which categorises levels of driving automation. ISO 26262 addresses hazards from component failures. ISO 21448 tackles the harder problem of functional insufficiencies: what happens when the system works as designed but the design doesn't account for reality.

These frameworks gave manufacturers (e.g. Waymo and Wayve) a common language. Regulators can set consistent benchmarks. Consumers can understand what they're trusting.

Healthcare has no equivalent for AI consultations.

The Missing Framework

Current evaluations of clinical AI focus on what's easy to measure: diagnostic accuracy, answer relevance, alignment with clinical guidelines. These matter. But they capture only part of what makes a consultation effective.

Anyone who has practiced medicine knows that clinical encounters involve two intertwined dimensions. There's the technical work: gathering information, generating diagnoses, providing advice. And there's the relational work: recognising distress, building trust, ensuring understanding, responding to emotion.

Medical education has long recognised this distinction. The Calgary-Cambridge model breaks consultations into stages while emphasising relationship-building throughout. Observational systems like the Roter Interaction Analysis System code dozens of discrete behaviours spanning both dimensions. We train human clinicians to do both. We assess them on both. Our professional societies, accreditors, and legal jurisdictions hold us to both.

Yet when AI systems are evaluated today, it is typically only on the technical dimension.

Cure and Care

Our paper proposes a two-layer behavioural classification for AI consultations.

The first layer comprises what we call "cure" behaviours: the instrumental, technical tasks of clinical work. Information gathering. Diagnostic reasoning. Information provision. Emergency detection.

The second layer comprises "care" behaviours: the relational skills that make consultations therapeutic rather than merely transactional. Establishing rapport. Demonstrating empathy. Ensuring clarity. Responding appropriately to emotional cues. These behaviours are harder to specify and measure, but they're not optional extras. Most clinicians are now aware that research confirms the link between patient-centred communication and health outcomes.

The framework we propose isn't just descriptive. We suggest concrete evaluation scenarios and metrics for each category. A simulated patient phones in for post-operative symptom assessment; the AI must recall predefined clinical red flags while also eliciting the patient's ideas, concerns, and expectations. A patient expresses worry; the AI must respond to the emotional cue before continuing with clinical questions. A patient with low health literacy needs a different type of explaination; we measure comprehension through clarification and 'teach-back'.

Context Matters

Not every consultation is the same, and the framework accounts for this through what we call "contextual modulators."

Clinical context determines which behaviours matter most. A routine post-operative checkup might seem simple, but its safety hinges on detecting subtle signs of serious complications against a background of expected recovery. A new cancer diagnosis places enormous demands on care behaviours, but its cure requirements (the clinical information to convey) may be highly structured and predictable.

This challenges simplistic assumptions about which consultations are "easy" to automate. The answer isn't determined by emotional intensity alone.

Patient factors also shape requirements. How an AI demonstrates empathy or explains a care plan must be effective across different cultural backgrounds, health literacy levels, and age groups. An AI that performs well with one patient population may fail with another.

The Road Ahead

At Ufonia, we've conducted over 150,000 autonomous patient conversations across the NHS. The framework we propose isn't theoretical for us. It's a formalisation of principles we've been applying operationally.

Care behaviours are genuinely hard to specify and measure. What counts as appropriate empathy varies by context and culture. Rapport is easier to recognise than to operationalise. We need insights from disciplines beyond medicine and engineering, which is why we have active research collaborations spanning ethics and linguistics as well as computer science.

When implementing these technologies, we need to bring together insights from not only the IT department but also the professional credentialing committee. We need to consider how these systems should be validated against professional standards of communication, not just algorithmic performance. This may mean bringing together oversight from both the FDA/MHRA and the State Boards/GMC. We wouldn't license a physician who could diagnose accurately but couldn't communicate with patients. We shouldn't deploy AI systems that way either.

But we can't simply add another layer of bureaucracy that slows adoption. Technology is moving fast. We have AI systems that outperform doctors today, we need these to be implemented to allow physicians to meet the growing demand for care, and to improve access and experience for patients.

The automotive industry's safety standards took time to develop. We should build on that experience rather than reinvent the wheel! This paper is our attempt to kickstart that conversation.


The full paper, "Building a code of conduct for AI-driven clinical consultations" is available in Nature Medicine. For more on how Ufonia approaches clinical AI safety, explore our Science page. If you work in the NHS, please contact us for more information on how we can deploy with you. In the US, sign up to our Launch Partner Programme to lead the way in using AI to transform your practice.