Arena has launched a new leaderboard that ranks AI models based on their performance in real-world tasks, rather than on traditional benchmark questions. This initiative evaluates agents—developed by top labs including OpenAI, Anthropic, and Google DeepMind—using data from live user interactions involving complex workflows like coding, research, and document creation, totalling over 300,000 tasks. The leaderboard assesses performance across five key metrics including task success and user feedback, emphasizing how well agents handle messy, iterative work and ongoing user corrections. Currently, GPT-5.5 High tops the leaderboard with a net improvement of 10.7%, followed by Claude Opus 4.7 Thinking and GPT-5.4 High.
Zai: Zai is the organization behind the GLM-5.1 model, which appears on the Agent Arena leaderboard. It provides models deployed via platforms like SiliconFlow for agent tasks.
Arena: Arena is the platform behind Agent Arena, focused on real-world evaluations of AI agents through live user sessions on arena.ai. It collects data from millions of in-the-wild interactions where users complete actual tasks with agentic tools. The company released its first Agent Arena leaderboard to rank models based on performance in complex workflows.
OpenAI: OpenAI develops advanced AI models including the GPT series. Its GPT-5.5 (High) variant leads the Agent Arena leaderboard for agentic performance. The company is a top lab in the new real-world agent evaluation system.
GLM-5.1: GLM-5.1 is a model from Zai that ranks on the Agent Arena leaderboard, demonstrating competitive agentic capabilities in live sessions.
Anthropic: Anthropic builds Claude models designed for safe and capable AI assistance. Its Claude Opus 4.7 (Thinking) ranks second on the Agent Arena leaderboard. The company contributes multiple high-performing models to the agentic evaluation.
Kimi-K2.6: Kimi-K2.6 is a model associated with Kimi_Moonshot that appears on the Agent Arena leaderboard among evaluated agent systems.
GPT-5.4 High: GPT-5.4 High is an OpenAI model placing third on the Agent Arena leaderboard for its effectiveness in handling complex, multi-step agent workflows.
GPT-5.5 High: GPT-5.5 High is an OpenAI model that tops the Agent Arena leaderboard with strong performance across task success, steerability, and error recovery signals.
Gemini-3.1 Pro: Gemini-3.1 Pro is a Google DeepMind model featured on the Agent Arena leaderboard for its performance with tools like web search and terminal commands.
Google DeepMind: Google DeepMind develops the Gemini family of AI models. Its Gemini-3.1 Pro ranks among the top entries on the Agent Arena leaderboard for orchestrating tools in real workflows.
Claude Opus 4.7 Thinking: Claude Opus 4.7 Thinking is an Anthropic model ranking second on the Agent Arena leaderboard, excelling in real-world agentic tasks involving user corrections and tool use.
Top Labs: Leading AI organizations including OpenAI, Anthropic, and Google DeepMind contribute models to the Agent Arena rankings for agentic performance.
Agent Evaluation: Arena’s leaderboard evaluates AI agents using causal tracing on live user sessions rather than isolated benchmarks.
Real-world Tasks: The system measures agents on messy, iterative work such as coding, research, document creation, and terminal operations with ongoing user feedback.
