DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
- Title: DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (Jan 2026)
- Link: http://arxiv.org/abs/2601.09688v1
- Date: January 2026
Abstract
This paper introduces DeepResearchEval, an automated framework designed to benchmark deep research systems—AI agents capable of complex, multi-step web investigation and report generation. Addressing the limitations of existing benchmarks which often rely on static criteria and limited factual verification, the authors propose a two-part solution. First, a persona-driven task construction pipeline generates realistic, complex research queries filtered for search necessity. Second, an agentic evaluation pipeline employs “Adaptive Point-wise Quality Evaluation” to dynamically derive task-specific metrics and “Active Fact-Checking” to autonomously verify claims via live web search, even without citations. The framework evaluates 9 major systems (including Gemini 2.5-Pro, OpenAI Deep Research, and Manus), revealing significant performance gaps between general quality and task-specific requirements.
Key Topics:
- Deep Research Systems
- Agentic Evaluation
- Automated Benchmarking
- Active Fact-Checking
- Task Construction
- Large Language Models