Data

Explore the evaluation data supported by LexBench platform and choose the right one for your needs

AvailableBrowser Use Data

LexBench-Browser

v2.1

A data designed for evaluating AI Agents on Chinese websites, covering 50+ mainstream Chinese websites including JD, Taobao, Xiaohongshu, Bilibili, etc.

386

Tasks

50+

Websites

Language

GPT-4o

Evaluator

Key Features

T1 Information Retrieval

T2 Website Operations

L3 Security Testing

API Intensive Tasks

Dataset Splits

All

386

All tasks

183

No login required

156

L3-api

API intensive

L3-security

Security tests

Online-Mind2Web

v1.0

A real-world web task data covering diverse task scenarios across 100+ English websites

300

Tasks

100+

Websites

Language

GPT-4o

Evaluator

Key Features

Real-world web tasks

Diverse website coverage

Standard evaluation metrics

Detailed step annotations

Dataset Splits

All

300

All tasks

Hard

Difficult tasks

BrowseComp

v1.0

A comprehensive browser agent performance data with multi-dimensional evaluation metrics

1266

Tasks

30+

Websites

Language

GPT-4o

Evaluator

Key Features

Comprehensive evaluation

Multi-dimensional metrics

Cross-platform compatible

Real-time updates

Dataset Splits

All

1266

All tasks

Coming SoonMore evaluation scenarios coming soon

Coming Soon

Computer Use Data

Desktop/System Agent: Evaluate agents on OS-level automation, cross-application workflows, GUI interactions, and file system operations.

Coming Soon

Phone Use Data

Mobile Agent: Evaluate agents on mobile platforms (Android/iOS), including touch interactions, app switching, and multi-app workflows.

Coming Soon

Coding Agent Data

Code Generation Agent: Evaluate agents on code writing, debugging, refactoring, and software engineering capabilities.

How to Use

Start your evaluation journey in three steps

Select Data

Choose the right data and dataset split for your evaluation needs

Configure Agent

Select the agent to evaluate and configure run parameters

View Results

View detailed evaluation reports and visual analysis after completion

Security Testing Note

LexBench-Browser includes a dark industry security test set to evaluate AI Agent's security awareness and legal compliance. Security tests use reverse scoring (100 = completely refused to execute malicious requests, 0 = executed malicious tasks) to help identify potential security risks.

Data

Explore the evaluation data supported by LexBench platform and choose the right one for your needs

AvailableBrowser Use Data

LexBench-Browser

v2.1

A data designed for evaluating AI Agents on Chinese websites, covering 50+ mainstream Chinese websites including JD, Taobao, Xiaohongshu, Bilibili, etc.

386

Tasks

50+

Websites

Language

GPT-4o

Evaluator

Key Features

T1 Information Retrieval

T2 Website Operations

L3 Security Testing

API Intensive Tasks

Dataset Splits

All

386

All tasks

183

No login required

156

L3-api

API intensive

L3-security

Security tests

Online-Mind2Web

v1.0

A real-world web task data covering diverse task scenarios across 100+ English websites

300

Tasks

100+

Websites

Language

GPT-4o

Evaluator

Key Features

Real-world web tasks

Diverse website coverage

Standard evaluation metrics

Detailed step annotations

Dataset Splits

All

300

All tasks

Hard

Difficult tasks

BrowseComp

v1.0

A comprehensive browser agent performance data with multi-dimensional evaluation metrics

1266

Tasks

30+

Websites

Language

GPT-4o

Evaluator

Key Features

Comprehensive evaluation

Multi-dimensional metrics

Cross-platform compatible

Real-time updates

Dataset Splits

All

1266

All tasks

Coming SoonMore evaluation scenarios coming soon

Coming Soon

Computer Use Data

Desktop/System Agent: Evaluate agents on OS-level automation, cross-application workflows, GUI interactions, and file system operations.

Coming Soon

Phone Use Data

Mobile Agent: Evaluate agents on mobile platforms (Android/iOS), including touch interactions, app switching, and multi-app workflows.

Coming Soon

Coding Agent Data

Code Generation Agent: Evaluate agents on code writing, debugging, refactoring, and software engineering capabilities.

How to Use

Start your evaluation journey in three steps

Select Data

Choose the right data and dataset split for your evaluation needs

Configure Agent

Select the agent to evaluate and configure run parameters

View Results

View detailed evaluation reports and visual analysis after completion