Story

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

Artificial IntelligenceMaterials & Engineering

Key takeaway

Researchers created a robot assembly system that uses touch sensors to make precise adjustments when visual feedback is blocked, enabling more reliable and accurate assembly of complex products.

Read the paper

Quick Explainer

ReTac-ACT is a multimodal transformer-based policy that fuses visual and tactile inputs to enable high-precision peg-in-hole assembly. Its key innovations are bidirectional cross-attention to integrate the visual and tactile signals, a state-gated dynamic fusion mechanism that adaptively weights the modalities based on the task phase, and an auxiliary tactile reconstruction objective to ensure the tactile encoder captures relevant contact geometry. This allows ReTac-ACT to seamlessly transition from vision-dominant to tactile-dominant control as the end-effector makes contact, addressing the limitations of pure-vision approaches which struggle with severe occlusion in the critical "last-millimeter" phase.

Deep Dive

Technical Deep Dive: ReTac-ACT for Precision Assembly

Overview

ReTac-ACT is a state-of-the-art vision-tactile policy for high-precision peg-in-hole assembly tasks. It uses a Transformer-based architecture with several key innovations:

Bidirectional Cross-Attention: Visual and tactile features are enhanced through reciprocal cross-attention to enable effective multimodal integration.
State-Gated Dynamic Fusion: A proprioception-conditioned gating mechanism adaptively weights the visual and tactile modalities based on the task state, smoothly transitioning from vision-dominant to tactile-dominant modes.
Tactile Representation Learning: An auxiliary tactile reconstruction objective ensures the tactile encoder captures manipulation-relevant contact geometry rather than just generic textures.

Evaluated on the NIST Assembly Task Board (ATB) M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods. It maintains 80% success even at the challenging 0.1 mm clearance level.

Problem & Context

Precision assembly tasks, such as high-precision peg-in-hole insertion, require sub-millimeter corrections during the "last-millimeter" contact phase. However, vision-only methods struggle in this regime due to severe occlusion from the end-effector and workpiece. Tactile sensing can provide the critical real-time contact feedback needed for these fine adjustments, but integrating tactile signals into existing robotic manipulation frameworks remains a key challenge.

Methodology

ReTac-ACT extends the Action Chunking with Transformers (ACT) framework to natively process both visual and tactile inputs. The key innovations are:

Modality-Specific Encoders: Visual and tactile inputs are separately encoded to allow specialization in each sensory domain.
Bidirectional Cross-Attention: Visual and tactile features are enhanced through reciprocal cross-attention to enable effective multimodal integration.
State-Gated Dynamic Fusion: A proprioception-conditioned gating mechanism adaptively weights the visual and tactile modalities based on the task state, smoothly transitioning from vision-dominant to tactile-dominant modes.
Tactile Representation Learning: An auxiliary tactile reconstruction objective ensures the tactile encoder captures manipulation-relevant contact geometry rather than just generic textures.

Data & Experimental Setup

The evaluation is conducted on the NIST Assembly Task Board (ATB) M1 benchmark, which requires inserting cylindrical pegs into holes at three clearance levels: 3 mm, 1 mm, and 0.1 mm. The hardware setup includes a bimanual robot with high-resolution visual (three RGB cameras) and tactile (four optical tactile sensors) sensing.

The authors also release a comprehensive vision-tactile dataset comprising over 5,000 expert demonstration trajectories covering 5 geometric shapes and 4 clearance levels.

Results

ReTac-ACT achieves the following performance:

90% peg-in-hole success rate at 3 mm clearance, outperforming:
- ACT (vision-only): 40%
- Diffusion Policy (vision-only): 20%
- pi05 (generalist vision-language-action): 20%
Maintains 80% success rate even at the challenging 0.1 mm clearance level, while ACT drops to 15% and Diffusion Policy fails completely (0%).

An ablation study confirms the importance of each architectural component:

Removing bidirectional cross-attention or reciprocal fusion reduces insertion success to 5%.
Removing the tactile reconstruction objective degrades performance across all phases.
Removing the state-gated fusion mechanism reduces insertion success from 90% to 35%.

Interpretation

ReTac-ACT's superior performance, especially at tight clearances, highlights the indispensability of tactile sensing for precision assembly tasks. As clearance tightens, pure-vision policies face exponentially growing challenges due to severe occlusion and resolution limitations, while tactile feedback enables real-time contact-based corrections.

The state-gated fusion mechanism allows ReTac-ACT to dynamically transition from vision-dominant to tactile-dominant modes, seamlessly adapting to the task phase. Visualization of the attention heatmaps shows that ReTac-ACT can instantly focus on task-relevant regions upon contact, in contrast to vision-only baselines.

Limitations & Uncertainties

The current evaluation is limited to cylindrical pegs. Extending to other NIST ATB shapes, especially non-axially-symmetric ones, remains future work, as tactile geometry may play an even more critical role in orientation alignment for those cases.

The authors also plan to investigate real-to-sim and sim-to-real transfer of tactile representations, as well as integration with large-scale vision-language-action pre-training to combine generalist knowledge with specialized tactile reasoning.

Source

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly
PreprintarXiv cs.RO3/19/2026