Story

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

ComputingArtificial Intelligence

Key takeaway

Researchers developed a new kind of web agent that can interpret ambiguous user queries by analyzing their browsing history, which could make online searches more personalized and contextual.

Read the paper

Quick Explainer

Persona2Web aims to benchmark the capability of personalized web agents to infer user context and preferences from their browsing histories, rather than relying on explicit instructions. It constructs realistic user profiles and ambiguous query sets that challenge agents to personalize their responses. The key components are a personalization module with planner, retriever, and generator sub-modules, and a reasoning-aware evaluation that distinguishes personalization failures from navigation failures. Persona2Web is the first benchmark of its kind, highlighting fundamental gaps in current personalization capabilities and providing a foundation for advancing personalized web agents that can effectively leverage user context.

Deep Dive

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Introduction

Large language models have advanced web agents, yet current agents lack personalization capabilities
Users rarely specify every detail of their intent, so practical web agents must infer user preferences and contexts
Existing benchmarks fail to provide realistic user context or handle ambiguous queries that require personalization

Methodology

User History

Rigorously constructed user histories reveal preferences implicitly over long time spans, rather than providing them explicitly
Profiles include demographic information and domain-specific preferences across 21 web domains
Event seeds define recurring activity patterns based on user preferences and routines

Ambiguous Query Sets

Follow "clarify-to-personalize" principle by intentionally masking explicit details, requiring agents to infer context from user history
Query sets have 3 levels: Level 0 (clear), Level 1 (preference explicit, website masked), Level 2 (both masked)

Reasoning-aware Evaluation

Evaluates personalization, intent satisfaction, and success rate using structured rubrics
Distinguishes personalization failures from navigation failures by examining reasoning traces
Uses GPT-5-mini as an LLM judge to score agent trajectories

Agent Architecture

Augment generic web agent architectures (AgentOccam, Browser-Use) with a personalization module
Personalization module has 3 components: planner, retriever, generator

History Access Schemes

On-demand: Agent accesses user history dynamically during execution
Pre-execution: Agent retrieves all relevant histories before execution

Results

Without user history, all agents fail completely on ambiguous queries (0% success rate)
Even with user history, performance improves only marginally (13% success rate at best)
Proprietary models (o3, GPT-4.1) outperform open-source models (Qwen3-80B, Llama-3.3-70B)
Task completion alone cannot capture personalization capability - agents can succeed at personalization but fail at navigation, or vice versa

Error Analysis

Four key error types:

Redundant History Access: Agent generates underspecified personalization queries, leading to unsuccessful retrieval and repeated attempts.
Personalization Hallucination: Agent fabricates user information without accessing relevant histories.
History Retrieval Failure: Agent fails to identify essential information in the user histories.
History Utilization Failure: Agent fails to properly apply the retrieved user histories during execution.

Conclusion

Persona2Web is the first benchmark for evaluating personalized web agents on the real open web
Findings reveal fundamental gaps in current personalization capabilities, highlighting the need for methods that can effectively leverage user context
Provides a strong foundation for advancing personalization in web agents

Impact Statement

This work uses synthetically generated user data, raising privacy considerations for deployed personalized agents
Encourag

Source

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
PreprintarXiv (cs.AI)2/20/2026