Story

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Health & MedicineArtificial Intelligence

Key takeaway

A new AI-based tool called CRIMSON can help evaluate the quality of automatically generated medical reports, ensuring they are clinically accurate and relevant to patient care.

Read the paper

Quick Explainer

CRIMSON is a framework for evaluating the quality and accuracy of machine-generated radiology reports. It goes beyond prior metrics by incorporating detailed clinical context, classifying errors based on their potential impact on patient care, and assigning severity-weighted scores. CRIMSON first extracts findings from the candidate and reference reports, assigning clinical significance to each finding. It then identifies different types of errors, such as false findings and missing or inaccurate attributes, and computes a final score that reflects the balance of correct findings and clinically consequential mistakes. This severity-aware approach allows CRIMSON to better align automated evaluation with real-world radiologist reasoning, providing a more meaningful assessment of report quality.

Deep Dive

Technical Deep Dive: CRIMSON - A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Overview

CRIMSON is a new evaluation framework for assessing the quality and accuracy of machine-generated radiology reports. Unlike previous metrics, CRIMSON incorporates detailed clinical context and models the severity of different types of errors based on their impact on patient care.

The key innovations of CRIMSON are:

Clinical Context Modeling: CRIMSON evaluates reports in the context of patient age, indication, and radiological guidelines to determine the clinical significance of each finding.
Comprehensive Error Taxonomy: CRIMSON categorizes errors into a detailed taxonomy covering false findings, missing findings, and attribute-level errors (e.g., location, severity, measurement).
Severity-Aware Scoring: CRIMSON assigns severity weights to errors based on their potential clinical impact, prioritizing clinically consequential mistakes over benign discrepancies.

Problem & Context

Automated radiology report generation has advanced rapidly, but reliable evaluation remains a fundamental challenge. Prior metrics have treated all errors as equally important or relied on coarse binary categorization of significance. In practice, the clinical consequences of errors vary substantially - for example, missing a life-threatening pneumothorax is far more serious than missing age-related aortic calcification.

Methodology

CRIMSON operates in three main stages:

Finding Extraction and Clinical Significance Assignment: CRIMSON extracts findings from the candidate and reference reports, and assigns each finding a clinical significance label (urgent, actionable non-urgent, non-actionable, or expected/benign) based on patient context and radiological guidelines.
Error Detection and Classification: CRIMSON identifies three primary error types: false findings, missing findings, and attribute-level errors (e.g., location, severity, measurement). Attribute errors are further classified as significant or negligible based on their potential clinical impact.
Severity-Aware Scoring: CRIMSON computes a final score between -1 and 1 that reflects the balance of correct findings and severity-weighted errors. The score is normalized such that a perfect report scores 1, a report with no more value than a normal template scores 0, and reports with more errors than correct findings score negative.

Data & Experimental Setup

CRIMSON was validated in three ways:

Correlation with Radiologist-Annotated Errors: CRIMSON was evaluated on 50 cases from the ReXVal dataset, which were annotated by six board-certified radiologists for clinically significant errors. CRIMSON showed strong correlation with the radiologist annotations.
RadJudge Clinical Judgment Test: The authors developed a suite of 30 curated, clinically challenging test cases, where multiple candidate reports were ranked by three radiologists. CRIMSON was the only metric to correctly solve all 30 cases, aligning with expert judgment.
RadPref Radiologist Preference Benchmark: The authors created a benchmark of 100 cases, where radiologists provided 1-5 quality ratings for pairs of candidate reports. CRIMSON demonstrated the strongest alignment with radiologist preferences compared to prior metrics.

Results

The key results of the CRIMSON evaluation are:

CRIMSON showed strong correlation (Kendall's τ = 0.61 - 0.71, Pearson's r = 0.71 - 0.84) with radiologist-annotated clinically significant error counts on the ReXVal dataset.
On the RadJudge clinical judgment test, CRIMSON correctly solved all 30 cases, significantly outperforming prior metrics.
On the RadPref radiologist preference benchmark, CRIMSON achieved the strongest alignment with radiologist quality ratings compared to existing approaches.

The authors also demonstrated that a fine-tuned version of the open-source MedGemma model, called MedGemmaCRIMSON, can closely approximate the performance of the GPT-5.2-based CRIMSON, enabling privacy-preserving local deployment.

Interpretation

CRIMSON's key innovations are its incorporation of detailed clinical context and severity-aware scoring, which allow it to better align automated evaluation with real-world radiologist reasoning. By prioritizing clinically consequential errors over benign discrepancies, CRIMSON provides a more meaningful and actionable assessment of report quality.

The strong performance of CRIMSON on the RadJudge and RadPref benchmarks suggests that it can capture nuanced clinical judgment in ways that prior metrics cannot. This is crucial for developing reliable, clinically relevant evaluation frameworks for radiology report generation systems.

Limitations & Uncertainties

A limitation of the current work is that the CRIMSON framework was developed and evaluated specifically for chest X-ray reports. Applying CRIMSON to other imaging modalities would require adapting the finding ontologies, severity criteria, and evaluation prompts to align with modality-specific clinical standards and reporting conventions.

Additionally, while CRIMSON demonstrates strong correlation with radiologist preferences, there may still be some level of subjectivity and variability in how radiologists assess report quality. The authors note that further research is needed to fully understand the extent to which automated metrics can capture the nuances of expert clinical judgment.

What Comes Next

The authors plan to extend the CRIMSON framework to support evaluation of radiology reports across a broader range of imaging modalities, beyond just chest X-rays. This will involve developing modality-specific finding taxonomies, severity rubrics, and evaluation prompts in collaboration with domain experts.

Additionally, the authors are interested in exploring how CRIMSON could be used to provide real-time feedback and guidance to radiology report generation models during training, potentially improving the clinical relevance and safety of the generated output.

Source

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
PreprintarXiv cs.CL3/19/2026