Story

ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

Health & MedicineComputing

Key takeaway

A new AI system can automatically categorize tumor types in cancer registries, saving hundreds of hours of manual labor annually and helping improve cancer care and tracking at the population level.

Read the paper

Quick Explainer

ELM is a hybrid system that combines the strengths of multiple small language models and a larger language model to accurately classify tumor types from cancer pathology reports. The small models analyze report text and make initial predictions, while the larger model is selectively used to refine challenging or uncertain classifications. This strategic combination of models allows for both high accuracy and computational efficiency, addressing the limitations of rule-based approaches that struggle with the linguistic complexity of clinical text. The key innovations are routing ambiguous cases to the more powerful language model and using prompt engineering to control its behavior, representing a novel approach to leveraging both specialized and general language understanding.

Deep Dive

Technical Deep Dive: ELM - A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

Overview

This work presents ELM (Ensemble of Language Models), a novel hybrid architecture that combines small, encoder-only language models and a large language model (LLM) to automate the classification of tumor groups from unstructured pathology reports. This task is a critical step in population-based cancer registry workflows, but current rule-based approaches struggle with the linguistic complexity of clinical text.

ELM leverages an ensemble of six fine-tuned encoder-only models to first analyze reports and make initial tumor group predictions. Cases where the ensemble lacks confidence or predicts historically challenging tumor groups are then routed to the LLM for final arbitration, guided by a carefully engineered prompt. This hybrid strategy achieves high accuracy while maintaining computational efficiency.

Methodology

The ELM architecture consists of:

An ensemble of six small, encoder-only language models:
- Three models analyze the first 512 tokens of each report, three analyze the last 512 tokens
- This ensures full report coverage despite token limitations
An ensemble voting mechanism with a high agreement threshold to make confident predictions on straightforward cases
An LLM (Mistral Nemo Instruct-2407) that arbitrates cases where the ensemble lacks confidence or classifies them into historically challenging tumor groups
A prompt engineered with domain experts to constrain the LLM's output space and provide structured reasoning

The key innovations are:

Combining the strengths of encoder-only models and LLMs
Strategically routing only ambiguous cases to the more expensive LLM
Leveraging prompt engineering to control LLM behavior

Data and Experimental Setup

Training data: 16,000 pathology reports from 2020, labeled with 19 tumor groups at the patient level
Test data: 2,058 reports from 2023-2024, labeled at the report level for unbiased evaluation
Compared to:
- eMaRC, a rule-based system currently used in many cancer registries
- LLM-only approach without the encoder-only ensemble

Results

ELM achieves a weighted average F1-score of 0.94, a statistically significant improvement over the encoder-only ensemble (0.91) and the LLM-only approach (0.88)
Particularly strong gains for challenging categories like leukemia (F1 increased from 0.76 to 0.88), lymphoma (0.76 to 0.89), and non-melanoma skin cancer (0.44 to 0.58)
Strategic routing of hard-to-classify tumor groups to the LLM further boosts performance, e.g., cervix (0.90 to 0.98) and multiple myeloma (0.87 to 0.91)
Smaller 3B LLMs provide viable alternatives when computational resources are limited, at a modest 2-3 point F1 reduction

Interpretation

ELM represents the first successful deployment of a hybrid encoder-only/LLM architecture for tumor group classification in a real-world cancer registry setting. This approach demonstrates how strategic combination of language models can achieve both high accuracy and operational efficiency.

Compared to rule-based systems, ELM's transformer-based models are better equipped to handle the linguistic complexity of pathology reports, capturing long-range dependencies and contextual nuances. By routing only ambiguous cases to the LLM, ELM maintains computational feasibility for high-volume registry workflows.

Limitations and Uncertainties

Generalizability to other cancer registries or healthcare systems requires external validation
Performance remains suboptimal for rare tumor types due to limited training data
The encoder-only ensemble operates as a relative black box, requiring manual review for cases where predictions are unexpected

What Comes Next

Multi-registry validation to assess generalizability
Extensions for multi-label classification to handle reports describing multiple primary cancers
Active learning strategies to continuously improve performance on rare tumor types
Exploration of advanced prompting techniques and smaller LLM models for better cost-performance trade-offs

Source

ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries
PreprintarXiv cs.CL3/20/2026