Story
Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
Key takeaway
A new AI system can automatically translate instruction manuals into robotic assembly plans that properly account for the physical constraints of connectors, enabling more reliable robotic assembly.
Quick Explainer
The core idea of Manual2Skill++ is to enable robots to assemble complex objects from instruction manuals by explicitly modeling the physical connections between parts. It does this by extracting a hierarchical graph representation from the manuals, encoding the assembly structure, component relationships, and detailed connector information. A vision-language model is used to parse the manuals and extract this structured knowledge, which is then used to compute precise part poses and enable reliable assembly execution. This novel connector-aware approach addresses a critical gap in existing assembly methods, which have focused more on part geometry and dynamics while overlooking the fundamental role of connectors in successful assembly.
Deep Dive
Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
Overview
This paper introduces a framework called Manual2Skill++ that enables robotic assembly from instruction manuals by explicitly modeling the physical connections between parts. The key contributions are:
- A hierarchical graph representation that encodes the assembly structure, component relationships, and detailed connector information.
- A vision-language model (VLM) framework that extracts this structured representation from assembly manuals, parsing symbolic diagrams and annotations.
- A simulation benchmark with four complex assembly tasks featuring diverse connector types (mortise-tenon, dowels, screws) to evaluate the end-to-end assembly pipeline.
The proposed approach bridges the gap between high-level task understanding and low-level execution by directly computing precise part poses from the extracted connection constraints, enabling robust assembly across a range of realistic scenarios.
Problem & Context
Assembly tasks fundamentally concern the reliable connection of individual components into a coherent whole. While significant progress has been made in robotic execution and spatial reasoning, existing methods predominantly focus on part geometry and dynamics, treating the underlying connection relationships as secondary.
In practical assembly scenarios, the success of a task depends not just on spatial alignment, but on the informed selection and application of specific connectors (e.g., adhesives, mortise-tenon joints, screws). There remains a critical gap in interpreting and utilizing structured connection knowledge, which severely limits the applicability of existing methods to real-world assembly.
Humans naturally learn assembly procedures from instruction manuals, which encode connection-level information that is essential for task understanding and execution. However, prior work has largely overlooked this connector-level information when extracting task structure from manuals.
Methodology
Hierarchical Graph Representation
The authors propose a hierarchical graph representation for assembly tasks, where:
- Leaf and intermediate nodes represent parts and sub-assemblies
- Composition edges connect parent-child nodes
- Equivalence edges link identical parts
- Connection edges encode connector types, quantities, and spatial constraints between sibling nodes
This abstraction bridges high-level task understanding and low-level robotic execution.
VLM-Guided Graph Extraction
The authors introduce Manual2Skill++, a framework that leverages a vision-language model (VLM) to parse instruction manuals and extract the hierarchical graph representation:
- The VLM first estimates the number and types of connectors required for each assembly step.
- It then identifies the precise attachment point pairs that should be connected by each connector.
By extracting this connection-level information, Manual2Skill++ enables the direct computation of relative part poses via geometric constraint optimization, achieving significantly higher accuracy than prior pose estimation methods.
Simulation Benchmark
The authors develop a simulation benchmark in Isaac Lab featuring four complex assembly tasks with diverse connector types (mortise-tenon, dowels, screws). Unlike previous work, this benchmark incorporates full-scale assets and realistic connection mechanics, providing a robust testbed for evaluating task planning and contact-rich control under authentic assembly constraints.
Results
Hierarchical Graph Generation
The authors curate a dataset of 21 diverse assembly tasks and evaluate their VLM-based approach for extracting the hierarchical graph representation:
- Their method substantially outperforms random sampling and geometric heuristics, highlighting the non-trivial nature of the assembly steps.
- Reasoning VLMs exceed non-reasoning models by over 20% in performance, demonstrating the dataset's demand for advanced visual-spatial reasoning.
- The approach remains robust even when a manual step is omitted, correctly inferring the missing connection information.
Pose Alignment
The authors' constraint-matching approach consistently achieves millimeter-level alignment accuracy, significantly outperforming baseline methods that incur centimeter-level errors.
Connection Execution
Evaluating the end-to-end assembly pipeline, the authors' force-position hybrid strategy achieves success rates ranging from 73.7% to 81.6% across the four benchmark tasks, demonstrating the reliability provided by the connection-enriched hierarchical representation and precise pose alignment.
Interpretation
The authors' key contributions are:
- A hierarchical graph representation that explicitly models connection relationships, bridging high-level task understanding and low-level execution.
- A VLM-based framework that extracts this structured representation from assembly manuals, leveraging the rich connector-level information embedded in human-designed instructions.
- A simulation benchmark featuring realistic, connector-aware assembly scenarios to evaluate automated assembly systems.
By focusing on connection modeling, the authors address a critical gap in existing assembly approaches, which have predominantly focused on part geometry and dynamics while overlooking the fundamental role of connectors. The presented framework provides a robust foundation for developing truly generalizable robotic assembly systems.
Limitations & Uncertainties
The authors acknowledge that their approach assumes access to precise 3D models of the parts, which may not always be available in real-world settings. Additionally, the performance of the VLM-based graph extraction may be sensitive to the quality and diversity of the training data.
What Comes Next
The authors suggest that their work lays the groundwork for future advances in connector-aware manipulation, multimodal task understanding, and the development of truly generalizable assembly systems. Potential next steps could include exploring the use of lower-fidelity or partial 3D models, as well as investigating techniques to improve the robustness and generalization of the VLM-based graph extraction.
