Curious Now

Story

TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning

Artificial IntelligenceMaterials & Engineering

Key takeaway

Researchers developed a system called TrajBooster that allows robots to quickly learn new manipulation tasks, even with limited training data. This could make humanoid robots more versatile and useful in real-world environments.

Read the paper

Quick Explainer

TrajBooster addresses the challenge of enabling wide-range, whole-body manipulation for bipedal humanoids. It does this by leveraging end-effector trajectories extracted from a large-scale dataset of wheeled humanoids, and transferring them to the target bipedal platform through a real-to-sim-to-real pipeline. This involves training a hierarchical retargeting model in simulation to track the trajectories using whole-body control, and then using the retargeted data to fine-tune a pre-trained vision-language-action model on the real humanoid. This approach mitigates the data scarcity issue that has limited previous work in this domain, and enables the model to better align with the action space of the new platform, as well as demonstrate improved robustness and zero-shot transfer capabilities.

Deep Dive

Technical Deep Dive: TrajBooster for Bipedal Whole-Body Manipulation

Overview

This work presents TrajBooster, a cross-embodiment framework that leverages end-effector trajectories to boost the performance of vision-language-action (VLA) models for bipedal humanoid whole-body manipulation. By transferring large-scale demonstrations from wheeled humanoids to the target bipedal Unitree G1 platform, TrajBooster mitigates the data scarcity challenge that has limited previous VLA research in this domain.

Problem & Context

Recent advancements in humanoid manipulation have enabled autonomous household task execution with improved reliability and generalization. However, a critical gap remains in enabling wide-range, whole-body manipulation for bipedal humanoids. This capability requires large-scale demonstrations, yet existing teleoperation pipelines yield datasets that are small and limited in diversity.

As a result, VLA models struggle to align with the action spaces of new humanoid platforms during post-training. While pretraining on heterogeneous robot corpora helps, it cannot replace high-quality, humanoid-relevant, whole-body demonstrations with sufficient coverage.

Methodology

TrajBooster addresses this problem with a real-to-sim-to-real pipeline:

  1. Real Trajectory Extraction: End-effector trajectories are extracted from the Agibot-World dataset, a large-scale wheeled humanoid dataset.
  2. Retargeting in Simulation: A hierarchical retargeting model is trained in the Isaac Gym simulator to track the extracted trajectories using whole-body control for the target Unitree G1 humanoid.
  3. Finetuning for Real Humanoid: The retargeted data is used to post-pre-train a pre-trained VLA model, which is then fine-tuned with just 10 minutes of real-world teleoperation data on the Unitree G1.

Data & Experimental Setup

  • The Agibot-World dataset is used as the source of real-world manipulation data, containing over 1 million trajectories.
  • The retargeted data is generated in the Isaac Gym simulator, creating a large-scale dataset compatible with the Unitree G1 humanoid.
  • For real-world fine-tuning, 28 episodes of Unitree G1 teleoperation data across 4 different height configurations are collected.

Results

  1. Improved Trajectory Retargeting: The proposed hierarchical model with harmonized online DAgger outperforms baselines like PPO and standard DAgger in tracking performance.
  2. Accelerated Adaptation to Humanoid Action Space: The post-pre-trained VLA model achieves higher success rates on real-world tasks compared to models trained solely on the limited teleoperation data.
  3. Enhanced Trajectory Generalization: The post-pre-trained model demonstrates improved robustness to variations in object placement, outperforming the model trained only on real-world data.
  4. Unlocking Zero-shot Skill Generalization: The post-pre-trained VLA successfully executes a task (Pass the Water) that was included in the simulation dataset but not the teleoperation data, showcasing zero-shot transfer capabilities.

Limitations & Uncertainties

  • The Unitree Dex-3 end-effector limits tasks to simple pick-and-place due to limited precision. Future work will employ dexterous hands with tactile sensing.
  • The method only replaces the action space, while retaining visual input. Improved embodiment alignment in visual observations is needed.
  • The lack of large-scale loco-manipulation data confines the study to mostly stationary tasks. Future work will extend to richer mobile scenarios.
  • The experiments are limited by the scale of the dataset and computational resources. Incorporating more heterogeneous data beyond Agibot G1 is planned.

What Comes Next

The authors intend to address the limitations identified in this work, including:

  • Exploring dexterous hands with tactile sensing for advanced manipulation tasks
  • Aligning visual observations with the target embodiment to improve perception-action consistency
  • Expanding the framework to richer mobile scenarios with larger-scale loco-manipulation data
  • Scaling the experiments by incorporating more heterogeneous robot data beyond the Agibot G1

Source