Story
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Key takeaway
Researchers developed a new kind of web agent that can interpret ambiguous user queries by analyzing their browsing history, which could make online searches more personalized and contextual.
Quick Explainer
Persona2Web aims to benchmark the capability of personalized web agents to infer user context and preferences from their browsing histories, rather than relying on explicit instructions. It constructs realistic user profiles and ambiguous query sets that challenge agents to personalize their responses. The key components are a personalization module with planner, retriever, and generator sub-modules, and a reasoning-aware evaluation that distinguishes personalization failures from navigation failures. Persona2Web is the first benchmark of its kind, highlighting fundamental gaps in current personalization capabilities and providing a foundation for advancing personalized web agents that can effectively leverage user context.
Deep Dive
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Introduction
- Large language models have advanced web agents, yet current agents lack personalization capabilities
- Users rarely specify every detail of their intent, so practical web agents must infer user preferences and contexts
- Existing benchmarks fail to provide realistic user context or handle ambiguous queries that require personalization
Methodology
User History
- Rigorously constructed user histories reveal preferences implicitly over long time spans, rather than providing them explicitly
- Profiles include demographic information and domain-specific preferences across 21 web domains
- Event seeds define recurring activity patterns based on user preferences and routines
Ambiguous Query Sets
- Follow "clarify-to-personalize" principle by intentionally masking explicit details, requiring agents to infer context from user history
- Query sets have 3 levels: Level 0 (clear), Level 1 (preference explicit, website masked), Level 2 (both masked)
Reasoning-aware Evaluation
- Evaluates personalization, intent satisfaction, and success rate using structured rubrics
- Distinguishes personalization failures from navigation failures by examining reasoning traces
- Uses GPT-5-mini as an LLM judge to score agent trajectories
Agent Architecture
- Augment generic web agent architectures (AgentOccam, Browser-Use) with a personalization module
- Personalization module has 3 components: planner, retriever, generator
History Access Schemes
- On-demand: Agent accesses user history dynamically during execution
- Pre-execution: Agent retrieves all relevant histories before execution
Results
- Without user history, all agents fail completely on ambiguous queries (0% success rate)
- Even with user history, performance improves only marginally (13% success rate at best)
- Proprietary models (o3, GPT-4.1) outperform open-source models (Qwen3-80B, Llama-3.3-70B)
- Task completion alone cannot capture personalization capability - agents can succeed at personalization but fail at navigation, or vice versa
Error Analysis
Four key error types:
- Redundant History Access: Agent generates underspecified personalization queries, leading to unsuccessful retrieval and repeated attempts.
- Personalization Hallucination: Agent fabricates user information without accessing relevant histories.
- History Retrieval Failure: Agent fails to identify essential information in the user histories.
- History Utilization Failure: Agent fails to properly apply the retrieved user histories during execution.
Conclusion
- Persona2Web is the first benchmark for evaluating personalized web agents on the real open web
- Findings reveal fundamental gaps in current personalization capabilities, highlighting the need for methods that can effectively leverage user context
- Provides a strong foundation for advancing personalization in web agents
Impact Statement
- This work uses synthetically generated user data, raising privacy considerations for deployed personalized agents
- Encourag
