Story

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Health & MedicineArtificial Intelligence

Key takeaway

Researchers found that while AI models can handle multiple ultrasound tasks, they may perform worse than specialized models. This matters as it could impact the accuracy and usefulness of AI-powered medical imaging tools.

Read the paper

Quick Explainer

The researchers developed a framework called M2DINO that extends an existing deep learning architecture to enable comprehensive multi-organ, multi-task ultrasound analysis. M2DINO uses a shared encoder with task-specific Mixture-of-Experts blocks to adaptively allocate model capacity across diverse ultrasound prediction tasks, such as segmentation, classification, and regression. This adaptive capacity allocation helps M2DINO outperform both task-specific models and a unified model that treats all tasks equally, particularly when dealing with heterogeneous data scales across different clinical domains. The key innovation is using this Mixture-of-Experts approach to enable effective task aggregation, which improves performance in data-rich settings while maintaining stability in low-data scenarios.

Deep Dive

Technical Deep Dive: Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Overview

This work investigates the impact of task aggregation strategies on the performance of unified multi-organ, multi-task ultrasound foundation models. The authors introduce the M2DINO framework, which extends the DINOv3 architecture with task-conditioned Mixture-of-Experts (MoE) blocks to adaptively allocate model capacity across heterogeneous prediction tasks.

Problem & Context

Ultrasound imaging is widely used in clinical care, but models typically focus on isolated tasks or limited combinations rather than enabling comprehensive multi-organ, multi-task analysis.
Foundation models aim to unify multiple clinical tasks, but recent studies report that unified models can underperform task-specific baselines.
It remains unclear which tasks can be effectively unified without inducing negative transfer, and how training data scale modulates these interactions.

Methodology

The authors evaluate three training paradigms:
- Task-specific (TS): Each task is trained independently.
- Clinically-grouped (CG): Tasks within predefined clinical groups (e.g., obstetrics, breast, lung) are trained jointly.
- All-task unified (AU): All 27 tasks are trained simultaneously.
M2DINO uses a shared DINOv3 encoder with task-conditioned MoE blocks to enable adaptive capacity allocation.
Experiments cover 27 ultrasound tasks spanning segmentation, classification, regression, and detection across diverse anatomical regions.

Data & Experimental Setup

The dataset contains 32,311 training and 8,077 validation samples across 27 tasks.
Tasks are divided into three clinical groups: obstetrics (OB, 11,910 samples), breast (2,920 samples), and lung (2,254 samples).
Evaluation metrics include Dice Similarity Coefficient (DSC) for segmentation, AUC/F1/MCC for classification, IoU for detection, and Mean Radial Error (MRE) for regression.

Results

In the data-rich OB group, both CG and AU training improve over TS on most tasks.
However, in the smaller Breast and Lung groups, CG shows large performance drops, while AU exhibits more stable performance.
The impact of CG vs. AU depends strongly on data scale:
- OB group (11,910 samples): CG +2.93%, AU +3.76% relative to TS.
- Breast group (2,920 samples): CG -0.29%, AU -0.02%.
- Lung group (2,254 samples): CG -0.07%, AU +0.07%.
Segmentation tasks show the largest performance drops and negative transfer, while regression and classification remain more stable.

Interpretation

Task aggregation effectiveness is governed by data scale and task characteristics, not just clinical taxonomy.
CG improves performance in data-rich settings but can induce negative transfer in low-data settings.
AU provides more stable cross-task transfer, suggesting a regularizing effect that reduces overfitting.
Segmentation tasks are particularly sensitive to aggregation design, underscoring the need for principled task selection.

Limitations & Uncertainties

The analysis is limited to a single backbone (DINOv3) and predefined clinical grouping strategies.
The findings are specific to ultrasound imaging, and it's unclear if similar patterns generalize to other 2D or 3D medical imaging modalities.

What Comes Next

Explore alternative architectures, such as ultrasound-specific foundation models or data-driven grouping schemes.
Investigate whether the observed patterns generalize to other medical imaging domains beyond ultrasound.
Develop principled guidelines for task selection and aggregation in unified foundation models.

Sources:

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Source

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models
PreprintarXiv cs.AI3/20/2026