Now live in ophthalmology

AutonomousClinical Safety

Our approach to designing, evaluating, and assuring autonomous clinical AI.

Dell Medical School logo
Moorfields Eye Hospital NHS Foundation Trust logo
Newcastle University logo
University College London logo
University of Oxford logo
University of York logo

Healthcare needs its own "highway code" for clinical AI.

Just as self-driving cars have a highway code to handle merging and emergency stops safely, clinical AI needs a clear safety framework. We set out this vision in Nature Medicine and we're now building the tools to make it operational in real systems.

Fragmented Evaluation

Clinical consultations are not just about accurate diagnoses or asking the next right question. Safety depends on handling uncertainty, balancing being helpful and being safe, and building a relationship, all in a dynamic conversation with a real person.

However, most current benchmarks score isolated sub-tasks. But harm often occurs between these tasks: missed red flags, unsafe reassurance, poor escalation, or advice that sounds fluent but isn't clinically grounded.

Without a whole-consultation view, it's hard to build a system that patients, clinicians and regulators can trust, and difficult to improve systems responsibly.

What's missing:
a holistic view
Diagnostic Accuracy
Note Summarisation
Management Planning
Question Answering
Empathic Communication
Fragmented Evaluations
History Taking
Question Answering
Treatment Planning
Empathy & Rapport
Active Listening
Shared Decisions
CureCare
SAFE

Evaluating the Whole Consultation

At Ufonia, we've spent years deploying AI that talks to real patients. This experience taught us that safety requires evaluating complete clinical behaviours — not just isolated tasks.

The 'Cure' (Technical Competence)

Does the AI take a correct history, identify red flags, and triage correctly? This is the medical logic.

The 'Care' (Relational Competence)

Does the AI listen actively, show empathy, and explain clearly? This is the human connection.

Why Scalable
Evaluation Matters

Evaluating clinical AI's safety has long involved a trade-off between scalability and clinical relevance:

  • 1.
    Automated metrics — Developed for general AI models, scale easily, but fail to capture clinical nuance or real-world clinical safety.
  • 2.
    Human expert review — Clinicians manually checking AI responses. Provides depth and credibility, but is variable, prohibitively slow and expensive to run continuously.

Our safety frameworks (ASTRID and MATRIX), align automated evaluation with expert clinical judgement, enabling meaningful, scalable testing without reducing safety to simplistic metrics.

Scalability
Clinical Relevance

Generic Auto Scoring

Not clinically relevant

ASTRID /
MATRIX

Automated & Clinically Validated

Ad-hoc Checks

"Eye-balling" systems

Clinical Human Eval

Difficult to scale and inconsistent

Safety depends on
multiple behaviours working together.

We evaluate each of these behaviours independently, using purpose-built and clinically validated safety frameworks inspired by assurance methods from other high-stakes industries such as self-driving cars.

Dora AI

Dora

Autonomous Consultations

Voice Understanding

LLM-based evaluators calibrated to align with clinician judgement assess Dora's "hearing" (transcription) behaviour at scale.

ASTRID

Evaluating Context, Refusal, and Faithfulness to prevent hallucinations.

MATRIX

Structured simulation of hazardous scenarios to stress-test clinical history taking.

Safety through large-scale clinical-simulation

Clinical consultations are safety-critical. A single missed red flag or misplaced reassurance can cause harm — even when individual answers appear clinically correct.

In autonomous driving, safety is earned through exposure: millions of miles driven across diverse and hazardous conditions. For clinical AI, the equivalent is experience in conversation. MATRIX evaluates dialogue agents across thousands of minutes of simulated clinical dialogue, exposing systems to the situations that matter for patient safety before they interact with real patients.

Catalogue
Hazards
Simulate
Simulate
Conversations
Evaluate
Safety
MATRIX - Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

How MATRIX Works

A closed-loop simulation framework that runs thousands of realistic clinical conversations to uncover hazardous behaviours.

Safety Taxonomy

Structured map of patient input types, expected behaviours, and 40+ hazardous scenarios.

PatBot — Simulated Patient

LLM-driven patient persona that can display anxiety, confusion, or even derail conversations.

BehvJudge — Automated Safety Auditor

Reviews transcripts and flags unsafe behaviours. Matches/exceeds clinician hazard detection.

1. CONFIG/2. SIM/3. EVAL

What MATRIX Reveals

  • Frontier Models Fail: GPT-4 and Claude 3 Opus missed 12-15% of critical red-flag emergencies in our benchmarks.
  • Misplaced Reassurance: Pure LLMs frequently reassured patients who needed urgent care.
  • Dora's Safety: By using the MATRIX feedback loop, Dora achieves >98% pass rate across 40 hazardous scenarios.

MATRIX is used to evaluate Dora over thousands of minutes of conversation. Even though standard models don't pass, Dora's hybrid LLM and Deterministic system allows it perform safely.

Regulatory Alignment

  • Built on ISO 14971 & SaMD safety principles
  • Provides traceable, auditable safety evidence regulators expect

Clinician-Aligned Safety Auditing

BehvJudge closely matches clinician ratings

Evaluating Clinical Question Answering

Answering patient questions sounds simple - but in healthcare, it's one of the hardest things for AI to do safely. Patient questions are open-ended, ambiguous, and deeply dependent on clinical context. A response can sound fluent while being dangerous.

Standard AI metrics fall short here. Designed for general chatbots, they focus on surface-level similarity and miss key risks like hallucinated advice or failure to refuse unsafe queries. While human review is the gold standard, it is too slow and expensive to run at scale.

Patient"just one question I do have a slight shadow in my left eye..."

Agent

Harmful Response

Example of AI hallucinating medical advice against clinical consensus.

Question Answering Agent Architecture

To reduce hallucination risk, we use Retrieval Augmented Generation (RAG): where, the agent pulls from approved knowledge sources and guidelines before it answers.

Patient Question
Knowledge
Source
+
Safety
Guidelines
Dora
Dora
Grounded Response

Grounding in Action: The model retrieves relevant context from your specific data before generating a single word.

Assuring Safe Communication

ASTRID is a safety-driven evaluation framework designed specifically for clinical question-answering, combining RAG with advanced metrics to provide a comprehensive safety assessment.

It goes beyond standard metrics by triangulating safety across three critical dimensions: Refusal Accuracy (knowing when to stay silent), Context Relevance (using the right data), and Conversational Faithfulness (sticking to the source).

These signals are calibrated to align with clinician judgement, so we can evaluate safely at scale.

Astrid -An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Conversational Faithfulness

Is the information actually grounded in approved clinical sources?

Context Relevance

Did the system retrieve the right clinical knowledge for this situation?

Refusal Accuracy

Did it correctly decline to answer when it should?

Our Approach to Science

Ufonia invests heavily in research. We believe that clinical AI requires more than just engineering—it demands a rigorous scientific foundation.

A cross-disciplinary team of clinicians, safety scientists, AI research engineers, and regulatory experts come together to ensure our systems are safe, effective, and equitable. We don't just build models; we validate them through prospective studies and real-world deployments.

Our methods are published at top-tier AI research venues such as ACL and NeurIPS, and our real-world clinical impacts have been shared regularly at leading global conferences including ASCRS, ARVO, ESCRS, AECOS, AAO, and the Royal College of Ophthalmologists Annual Congress.

Ufonia Team

Building the safety layer
for autonomous medicine

We publish the evaluation frameworks and safety philosophy that turn autonomous consultations from a demo into something you can trust, audit, and deploy responsibly.

If you're a clinician thinking deeply about safety, or an AI researcher working on evaluation and assurance, we'd love to collaborate.