AA-AgentPerf releases initial results for DeepSeek V4 Pro benchmark

Today, a new agentic inference benchmark called AA-AgentPerf was introduced, showcasing results for the DeepSeek V4 Pro across various hardware, including NVIDIA’s Blackwell and AMD’s MI355X. Designed to evaluate power efficiency, AA-AgentPerf focuses on real production workloads with optimizations such as KV cache reuse, measuring performance in terms of Agents per Megawatt. Early findings reveal that NVIDIA’s GB300 system is approximately three times more power-efficient than single-node B300 configurations, highlighting significant advancements in inference efficiency from Blackwell over the previous Hopper generation, a crucial aspect as AI infrastructure increasingly prioritizes power efficiency when deploying agents.

AMD: AMD develops high-performance GPUs for AI and computing workloads, including the MI355X accelerator. Its offerings compete directly in the inference space. The benchmark results in the news compare AMD hardware performance and identify configuration opportunities for future improvement on DeepSeek V4 Pro.
NVIDIA: NVIDIA designs and manufactures GPUs and AI accelerators, including the Blackwell architecture used in systems like GB300 and B300. Its hardware is positioned as a leader in power-efficient inference. The news highlights NVIDIA platforms achieving top results in the AA-AgentPerf evaluation against competing hardware.
AA-AgentPerf: AA-AgentPerf is a benchmark specifically built to evaluate agentic inference using real long-context coding trajectories and production optimizations such as KV cache reuse and speculative decoding. It focuses on realistic workloads that agents encounter in practice. The benchmark is directly relevant because this news announces its initial public results, establishing a new standard for measuring inference efficiency on models like DeepSeek V4 Pro.
DeepSeek V4 Pro: DeepSeek V4 Pro is an advanced AI model optimized for complex, multi-turn agentic tasks with extended context lengths. It supports production inference techniques including disaggregation and speculative decoding. The model is central to the news as the first one tested under AA-AgentPerf across multiple hardware platforms.

AI Infrastructure: Power efficiency is a primary consideration for AI hardware providers when scaling agent deployments.
Benchmarking Trends: Agentic AI evaluations are shifting toward real production workloads and optimizations instead of synthetic queries.
Hardware Competition: NVIDIA Blackwell systems show generational advances over prior architectures in efficiency for inference tasks.