LogoLexBench
  • Leaderboard
  • Data
  • Agents
  • Features
  • Contact
Data

Data

Explore the evaluation data supported by LexBench platform and choose the right one for your needs

AvailableBrowser Use Data

LexBench-Browser

v2.1

A data designed for evaluating AI Agents on Chinese websites, covering 50+ mainstream Chinese websites including JD, Taobao, Xiaohongshu, Bilibili, etc.

386
Tasks
50+
Websites
ZH
Language
GPT-4o
Evaluator

Key Features

T1 Information Retrieval
T2 Website Operations
L3 Security Testing
API Intensive Tasks

Dataset Splits

All
386
All tasks
L1
183
No login required
L2
156
Login required
L3-api
22
API intensive
L3-security
25
Security tests

Online-Mind2Web

v1.0

A real-world web task data covering diverse task scenarios across 100+ English websites

300
Tasks
100+
Websites
EN
Language
GPT-4o
Evaluator

Key Features

Real-world web tasks
Diverse website coverage
Standard evaluation metrics
Detailed step annotations

Dataset Splits

All
300
All tasks
Hard
77
Difficult tasks

BrowseComp

v1.0

A comprehensive browser agent performance data with multi-dimensional evaluation metrics

1266
Tasks
30+
Websites
EN
Language
GPT-4o
Evaluator

Key Features

Comprehensive evaluation
Multi-dimensional metrics
Cross-platform compatible
Real-time updates

Dataset Splits

All
1266
All tasks
Coming SoonMore evaluation scenarios coming soon
Coming Soon

Computer Use Data

Desktop/System Agent: Evaluate agents on OS-level automation, cross-application workflows, GUI interactions, and file system operations.

Coming Soon

Phone Use Data

Mobile Agent: Evaluate agents on mobile platforms (Android/iOS), including touch interactions, app switching, and multi-app workflows.

Coming Soon

Coding Agent Data

Code Generation Agent: Evaluate agents on code writing, debugging, refactoring, and software engineering capabilities.

How to Use

Start your evaluation journey in three steps

1
Select Data

Choose the right data and dataset split for your evaluation needs

2
Configure Agent

Select the agent to evaluate and configure run parameters

3
View Results

View detailed evaluation reports and visual analysis after completion

Security Testing Note

LexBench-Browser includes a dark industry security test set to evaluate AI Agent's security awareness and legal compliance. Security tests use reverse scoring (100 = completely refused to execute malicious requests, 0 = executed malicious tasks) to help identify potential security risks.

LogoLexBench

Professional AI Agent Evaluation Platform

GitHubGitHubTwitterX (Twitter)BlueskyBlueskyMastodonDiscordYouTubeYouTubeLinkedInEmail
Evaluation
  • Leaderboard
  • Data
  • Agents
Resources
  • Blog
  • Documentation
  • Changelog
  • Roadmap
Company
  • About
  • Contact
  • Waitlist
Legal
  • Cookie Policy
  • Privacy Policy
  • Terms of Service
© 2026 LexBench All Rights Reserved.