Story
ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly
Key takeaway
Researchers created a robot assembly system that uses touch sensors to make precise adjustments when visual feedback is blocked, enabling more reliable and accurate assembly of complex products.
Quick Explainer
ReTac-ACT is a multimodal transformer-based policy that fuses visual and tactile inputs to enable high-precision peg-in-hole assembly. Its key innovations are bidirectional cross-attention to integrate the visual and tactile signals, a state-gated dynamic fusion mechanism that adaptively weights the modalities based on the task phase, and an auxiliary tactile reconstruction objective to ensure the tactile encoder captures relevant contact geometry. This allows ReTac-ACT to seamlessly transition from vision-dominant to tactile-dominant control as the end-effector makes contact, addressing the limitations of pure-vision approaches which struggle with severe occlusion in the critical "last-millimeter" phase.
Deep Dive
Technical Deep Dive: ReTac-ACT for Precision Assembly
Overview
ReTac-ACT is a state-of-the-art vision-tactile policy for high-precision peg-in-hole assembly tasks. It uses a Transformer-based architecture with several key innovations:
- Bidirectional Cross-Attention: Visual and tactile features are enhanced through reciprocal cross-attention to enable effective multimodal integration.
- State-Gated Dynamic Fusion: A proprioception-conditioned gating mechanism adaptively weights the visual and tactile modalities based on the task state, smoothly transitioning from vision-dominant to tactile-dominant modes.
- Tactile Representation Learning: An auxiliary tactile reconstruction objective ensures the tactile encoder captures manipulation-relevant contact geometry rather than just generic textures.
Evaluated on the NIST Assembly Task Board (ATB) M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods. It maintains 80% success even at the challenging 0.1 mm clearance level.
Problem & Context
Precision assembly tasks, such as high-precision peg-in-hole insertion, require sub-millimeter corrections during the "last-millimeter" contact phase. However, vision-only methods struggle in this regime due to severe occlusion from the end-effector and workpiece. Tactile sensing can provide the critical real-time contact feedback needed for these fine adjustments, but integrating tactile signals into existing robotic manipulation frameworks remains a key challenge.
Methodology
ReTac-ACT extends the Action Chunking with Transformers (ACT) framework to natively process both visual and tactile inputs. The key innovations are:
- Modality-Specific Encoders: Visual and tactile inputs are separately encoded to allow specialization in each sensory domain.
- Bidirectional Cross-Attention: Visual and tactile features are enhanced through reciprocal cross-attention to enable effective multimodal integration.
- State-Gated Dynamic Fusion: A proprioception-conditioned gating mechanism adaptively weights the visual and tactile modalities based on the task state, smoothly transitioning from vision-dominant to tactile-dominant modes.
- Tactile Representation Learning: An auxiliary tactile reconstruction objective ensures the tactile encoder captures manipulation-relevant contact geometry rather than just generic textures.
Data & Experimental Setup
The evaluation is conducted on the NIST Assembly Task Board (ATB) M1 benchmark, which requires inserting cylindrical pegs into holes at three clearance levels: 3 mm, 1 mm, and 0.1 mm. The hardware setup includes a bimanual robot with high-resolution visual (three RGB cameras) and tactile (four optical tactile sensors) sensing.
The authors also release a comprehensive vision-tactile dataset comprising over 5,000 expert demonstration trajectories covering 5 geometric shapes and 4 clearance levels.
Results
ReTac-ACT achieves the following performance:
- 90% peg-in-hole success rate at 3 mm clearance, outperforming:
- ACT (vision-only): 40%
- Diffusion Policy (vision-only): 20%
- pi05 (generalist vision-language-action): 20%
- Maintains 80% success rate even at the challenging 0.1 mm clearance level, while ACT drops to 15% and Diffusion Policy fails completely (0%).
An ablation study confirms the importance of each architectural component:
- Removing bidirectional cross-attention or reciprocal fusion reduces insertion success to 5%.
- Removing the tactile reconstruction objective degrades performance across all phases.
- Removing the state-gated fusion mechanism reduces insertion success from 90% to 35%.
Interpretation
ReTac-ACT's superior performance, especially at tight clearances, highlights the indispensability of tactile sensing for precision assembly tasks. As clearance tightens, pure-vision policies face exponentially growing challenges due to severe occlusion and resolution limitations, while tactile feedback enables real-time contact-based corrections.
The state-gated fusion mechanism allows ReTac-ACT to dynamically transition from vision-dominant to tactile-dominant modes, seamlessly adapting to the task phase. Visualization of the attention heatmaps shows that ReTac-ACT can instantly focus on task-relevant regions upon contact, in contrast to vision-only baselines.
Limitations & Uncertainties
The current evaluation is limited to cylindrical pegs. Extending to other NIST ATB shapes, especially non-axially-symmetric ones, remains future work, as tactile geometry may play an even more critical role in orientation alignment for those cases.
The authors also plan to investigate real-to-sim and sim-to-real transfer of tactile representations, as well as integration with large-scale vision-language-action pre-training to combine generalist knowledge with specialized tactile reasoning.
