Story
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
Key takeaway
Researchers created a large, verified math dataset to help AI systems better understand visual mathematical reasoning. This could lead to AI assistants that can provide more useful support for math education and problem-solving.
Quick Explainer
The core idea behind DeepVision-103K is to create a large, diverse dataset of multimodal mathematical problems for training reinforcement learning models with verifiable rewards. The dataset spans a broad range of visual elements, including geometry, analytic plots, and real-world objects in mathematical contexts, as well as diverse mathematical disciplines and visual logic problems. A three-stage curation pipeline transforms real-world K12 problems into structured and verifiable question-answer pairs, ensuring the dataset's validity and broad coverage. This dataset enables training large multimodal models with enhanced visual perception, reflection, and reasoning capabilities, outperforming models trained on other open-source datasets as well as official thinking variants.
Deep Dive
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
Overview
DeepVision-103K is a large-scale multimodal mathematical dataset designed for training reinforcement learning with verifiable rewards (RLVR). It features:
- Visual Diversity: DeepVision-103K covers major visual categories including geometry, analytic plots, charts, and real-world items in mathematical contexts, with richer element types than existing datasets.
- Broad Coverage: The dataset incorporates wide-ranging multimodal mathematical problems and visual logic problems, jointly enhancing mathematical and visual logic reasoning.
- Automatic Data Curation Pipeline: DeepVision-103K employs a three-stage pipeline of validity filtering, pass-rate-based difficulty calibration, and query correctness verification to transform diverse real-world K12 problems into structured and verifiable QA pairs.
Models trained on DeepVision-103K achieve state-of-the-art performance on both mathematical and general multimodal reasoning benchmarks, outperforming models trained on other open-source datasets as well as official thinking variants.
Problem & Context
Large language models (LLMs) trained with RLVR demonstrate remarkable reasoning capabilities. A key insight is that RLVR incentivizes thinking behaviors—the ability to decompose problems, self-correct in step-by-step reasoning. Recent works extend this paradigm to large multimodal models (LMMs), achieving enhanced visual reflection and reasoning abilities.
Central to this progress is high-quality training data, but existing datasets for multimodal RLVR exhibit several limitations:
- Synthetically Constructed Datasets: Fully synthesized with professional tools, they lack real-world mathematical scenarios, limiting robust generalization.
- Human-Annotated K12 Datasets: Dependent on expert annotation, limiting scalability.
- Recombination of Existing Datasets: Creating no novel problems, resulting in overlap across datasets and lacking broader data distribution.
Methodology
To address these limitations, we introduce the DeepVision-103K dataset, which consists of the following key components:
Visual Diversity
DeepVision-103K covers diverse visual elements across 6 categories: planar geometry, solid geometry, analytic plots, data charts, schematic diagrams, and real-world items. It also captures cross-category visual combinations and real-world items in mathematical contexts.
Broad Coverage
DeepVision-103K spans four major mathematical disciplines: geometry, algebra, probability and statistics, and fundamental mathematical skills. It also incorporates visual logic problems like mazes, chess, and tetris.
Data Curation Pipeline
We employed a three-stage curation pipeline to transform diverse real-world K12 problems into structured and verifiable QA pairs:
- Validity Filtering: Remove inherently unsuitable problems, such as proof-based, descriptive, and multi-answer questions.
- Difficulty Filtering: Calibrate sample difficulty based on model capability through rollout pass rates.
- Query Correctness Verification: Validate the correctness of image-question pairs and answers to eliminate corrupted samples.
Data & Experimental Setup
We conducted training on LMMs that already possess thinking capabilities, including MiMo-VL-7B-SFT-2508 and Qwen3-VL-8B-Instruct. Both models have been exposed to visual reasoning data during pretraining or midtraining stages, exhibiting native visual thinking abilities.
We employed GSPO for RL training, utilizing rule-based rewards based on answer correctness. We compared against closed-source models, official thinking variants, and models trained on other open-source datasets.
Results
Training on DeepVision-103K yields strong results in mathematical reasoning:
- Consistent Gains Across Benchmarks: Compared to respective Instruct/SFT baselines, Qwen3-VL-8B-DeepVision and MiMo-VL-7B-DeepVision achieve uniform improvements across all evaluated benchmarks.
- Substantial Improvements: On WeMath and LogicVista, DeepVision models surpass their official thinking variants and closed-source models, reaching state-of-the-art results.
- Superiority Over Existing Open-Source Datasets: Compared to models trained on other open-source datasets, MiMo-VL-7B-DeepVision demonstrates clear advantages.
DeepVision models also generalize effectively to general-purpose multimodal tasks, achieving consistent improvements over foundation models and surpassing official thinking variants across all benchmarks. In contrast, models trained on other open-source datasets show limited improvements in general domains.
Interpretation
Our analyses reveal that training on DeepVision-103K enhances three key capabilities:
- Enhanced Visual Perception: DeepVision models demonstrate improved "one-shot perception"—correctly identifying geometric shapes, numerical values, and spatial relationships in the initial observation, without requiring iterative re-examination.
- Enhanced Visual Reflection: When initial perceptual errors occur, DeepVision models exhibit a stronger capacity for genuine visual re-examination, actively recounting elements, remeasuring angles, and re-inspecting spatial relationships.
- Enhanced Mathematical Reasoning: Beyond visual capabilities, RL fine-tuning also enhances pure mathematical reasoning, demonstrating more rigorous problem-solving approaches compared to the base model.
Furthermore, our ablation studies show that incorporating both multimodal math data and visual logic data is crucial for achieving the best performance, as the two domains contribute complementary reasoning skills.
Limitations
While DeepVision-103K substantially increases visual diversity, the distribution is imbalanced, and some rare element types remain underrepresented. The dataset relies on external models for query correctness verification, which may introduce potential bias and additional cost. It also focuses on K12-level problems with unique final answers, not fully covering open-ended mathematical tasks that require richer evaluation signals.
What Comes Next
Future work could explore methods to further diversify the visual distribution, reduce reliance on external models, and expand the dataset to encompass more open-ended mathematical reasoning challenges. Investigating techniques to bridge the gap between mathematical and general multimodal reasoning could also be a promising direction.
