Stop Thinking, Just Do!

Sungsoo Kim's Blog

DeepResearchEval

tagsTags

20 January 2026


DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

  • Title: DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (Jan 2026)
  • Link: http://arxiv.org/abs/2601.09688v1
  • Date: January 2026

Abstract

This paper introduces DeepResearchEval, an automated framework designed to benchmark deep research systems—AI agents capable of complex, multi-step web investigation and report generation. Addressing the limitations of existing benchmarks which often rely on static criteria and limited factual verification, the authors propose a two-part solution. First, a persona-driven task construction pipeline generates realistic, complex research queries filtered for search necessity. Second, an agentic evaluation pipeline employs “Adaptive Point-wise Quality Evaluation” to dynamically derive task-specific metrics and “Active Fact-Checking” to autonomously verify claims via live web search, even without citations. The framework evaluates 9 major systems (including Gemini 2.5-Pro, OpenAI Deep Research, and Manus), revealing significant performance gaps between general quality and task-specific requirements.

Key Topics:

  • Deep Research Systems
  • Agentic Evaluation
  • Automated Benchmarking
  • Active Fact-Checking
  • Task Construction
  • Large Language Models