DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Title: DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (Jan 2026)
Link: http://arxiv.org/abs/2601.09688v1
Date: January 2026

Abstract

This paper introduces DeepResearchEval, an automated framework designed to benchmark deep research systems—AI agents capable of complex, multi-step web investigation and report generation. Addressing the limitations of existing benchmarks which often rely on static criteria and limited factual verification, the authors propose a two-part solution. First, a persona-driven task construction pipeline generates realistic, complex research queries filtered for search necessity. Second, an agentic evaluation pipeline employs “Adaptive Point-wise Quality Evaluation” to dynamically derive task-specific metrics and “Active Fact-Checking” to autonomously verify claims via live web search, even without citations. The framework evaluates 9 major systems (including Gemini 2.5-Pro, OpenAI Deep Research, and Manus), revealing significant performance gaps between general quality and task-specific requirements.

Key Topics:

Deep Research Systems
Agentic Evaluation
Automated Benchmarking
Active Fact-Checking
Task Construction
Large Language Models

Stop Thinking, Just Do!