Story

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Health & MedicineMind & Behavior

Key takeaway

Researchers developed an evaluation system to ensure mental health chatbots provide safe and clinically accurate support, addressing concerns about the risks of relying on AI for sensitive healthcare needs.

Read the paper

Quick Explainer

TherapyGym addresses the need for clinically-grounded evaluation and optimization of therapy chatbots. It uses a deep learning-based judge to assess the fidelity of therapy techniques and the safety of chatbot responses, based on expert-annotated dialogues. This judge is then used as a reward model to fine-tune the language model, improving its adherence to evidence-based therapy practices while reducing the risk of harmful behaviors. This approach enables scalable, continuous improvement of therapy chatbots, going beyond generic fluency metrics to focus on the clinical dimensions critical for effective psychotherapy.

Deep Dive

Technical Deep Dive: TherapyGym

Overview

TherapyGym is a framework for evaluating and improving the fidelity and safety of therapy chatbots powered by large language models (LLMs). It addresses the challenge that prevailing chatbot evaluation metrics fail to capture the clinical dimensions central to effective psychotherapy.

Problem & Context

Current chatbot evaluation methods focus on generic fluency and coherence, but do not assess the clinically critical dimensions of psychotherapy, such as adherence to evidence-based techniques (fidelity) and avoidance of harmful behaviors (safety).
Therapy chatbots are increasingly used for mental health support, but their quality and safety remain underspecified compared to human therapists.
Effective therapy involves complex, multi-turn interactions, making continuous human scoring of chatbots infeasible at scale.

Methodology

TherapyGym has two main components:

Evaluator Suite

TherapyJudgeBench: A set of 116 expert-annotated dialogues used to validate an LLM-based evaluator, TherapyJudge, that assesses CBT fidelity (via the Cognitive Therapy Rating Scale, CTRS) and safety (via a taxonomy of therapy-specific risks).
TherapyJudge achieves moderate-to-strong agreement with expert raters (Spearman ρ ≈ 0.56), demonstrating its ability to recover clinician-level signal.

Alignment Module

Uses the TherapyJudge as a reward model to fine-tune LLM therapists via reinforcement learning (GRPO).
Trains on 13,093 diverse patient profiles, generated and augmented using TreeSynth.
Optimizes for both CTRS-based skill scores and safety violation rates.

Results

GRPO fine-tuning improves average CTRS skill scores from 0.10 to 0.60 (human raters) and 0.16 to 0.59 (LLM judge).
Safety violations decrease from 0.38 to 0.20 (human raters).
Qualitative analysis shows the trained therapist exhibits more CTRS-aligned behaviors like agenda setting, automatic thought identification, and concrete homework assignment, compared to the untrained model.
Performance improves across LLM scales, with larger models (Qwen3-4B) scoring higher than smaller ones (Qwen3-1.7B).

Limitations & Uncertainties

Focus is on CBT; expansion to other therapy modalities (ACT, DBT) is future work.
Evaluation and training based on simulated patients; need to validate on real-world outcomes.
LLM-based judges may have biases; further work is needed to fully characterize and mitigate these.

What Comes Next

Longitudinal studies to assess real-world impacts of TherapyGym-trained chatbots.
Expansion to multilingual settings and diverse patient populations.
Incorporation of additional safety dimensions beyond the current taxonomy.
Exploration of multi-modal (text, voice, video) therapy chatbot capabilities.

Source

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
PreprintarXiv cs.CL3/20/2026