Story

ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning

ComputingArtificial Intelligence

Key takeaway

Researchers developed a new technique to learn efficient policy representations without expensive expert demonstrations. This could lead to more accessible AI systems that can better imitate human decision-making.

Read the paper

Quick Explainer

ACT-JEPA is a novel architecture that learns efficient policy representations by unifying imitation learning and self-supervised world model learning. The key idea is to jointly predict both action sequences and latent observation sequences, enabling the model to learn executable actions as well as a robust understanding of the environment dynamics. This joint optimization of the two objectives, rather than a conventional two-stage approach, is the distinctive aspect of ACT-JEPA. The architecture consists of an encoder, predictor, and decoder, which work together to filter out irrelevant details and improve computational efficiency compared to raw input space approaches.

Deep Dive

Technical Deep Dive: ACT-JEPA

Overview

ACT-JEPA is a novel end-to-end architecture that unifies imitation learning (IL) and self-supervised learning (SSL) objectives to enhance policy representation learning. It is designed to jointly predict action sequences and latent observation sequences, enabling it to learn both executable actions and a robust world model.

Problem & Context

Learning efficient representations for decision-making policies is a challenge in imitation learning (IL), as current IL methods require expensive expert demonstrations and do not explicitly train to understand the environment.
Self-supervised learning (SSL) can learn a world model from diverse, unlabeled data, but most SSL methods operate in raw input space, which is computationally inefficient.

Methodology

ACT-JEPA utilizes the Joint-Embedding Predictive Architecture (JEPA) to learn in latent space, filtering out irrelevant details and improving computational efficiency.
The architecture consists of four main components: context encoder, target encoder, predictor, and decoder.
ACT-JEPA is trained end-to-end to jointly optimize two objectives: 1) predicting action sequences using IL, and 2) predicting latent observation sequences using SSL.

Data & Experimental Setup

Evaluated on three simulated environments: Push-T, Meta-World, and ManiSkill.
Compared to two behavior cloning baselines: an autoregressive transformer policy and the ACT policy.

Results

World Model Evaluation:
- ACT-JEPA outperforms the ACT baseline by 29-40% in world model understanding, as measured by lower error in predicting future proprioceptive states.
- ACT-JEPA's world model representations become increasingly useful for downstream policy learning during pretraining.
Decision-Making Performance:
- ACT-JEPA achieves up to 10% higher task success rate compared to the baselines across the evaluated environments.
- The joint optimization of IL and SSL objectives outperforms the conventional two-stage approach, demonstrating the importance of aligning both objectives.

Interpretation

Integrating IL and JEPA-based SSL allows ACT-JEPA to learn efficient representations that capture both executable actions and environment dynamics.
Learning a JEPA-based world model in addition to IL enhances policy robustness and generalization, especially in environments where performance is not saturated.
The end-to-end joint optimization of IL and SSL objectives is more effective than the conventional two-stage approach, particularly under constrained data and computational resources.

Limitations & Uncertainties

Evaluation is limited to simulated environments, without testing in real-world robotic settings.
Focused on proprioceptive states as the target modality for the JEPA world model, leaving other modalities (e.g., images, touch) unexplored.
Dataset diversity and size are limited compared to large-scale real-world datasets, which may be necessary to fully harness the advantages of JEPA.

What Comes Next

Scaling ACT-JEPA to larger, more diverse datasets and evaluating in real-world robotic environments.
Exploring the integration of additional modalities (e.g., images, depth, tactile feedback) into the JEPA world model.
Investigating the potential of ACT-JEPA as a foundation policy model that can be adapted to a wide range of tasks and domains.

Source

ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning
PreprintarXiv cs.LG3/19/2026