Meta’s recent paper reveals that AI coding agents, like Claude 4.5 Opus, significantly improve their performance when they utilize concise summaries of prior attempts rather than relying on raw execution logs. The study emphasizes that enhancing test-time scaling is less about increasing the number of tries and more about how effectively these agents can store and recall their experiences. In practical tests, the coding agent’s performance increased substantially on two difficult benchmarks, with scores rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0, showcasing the potential advantages of structured memory in complex tasks.

Meta: Meta Platforms is a technology company with substantial investments in artificial intelligence research and development across its labs. Researchers at Meta authored the paper exploring improvements to agentic coding systems. The work focuses on using compact summaries of past attempts to enhance test-time scaling for long-horizon coding tasks rather than relying solely on additional generations.

Agent Memory: AI coding agents benefit from structured representations of prior attempts instead of raw execution logs when tackling complex, multi-step problems.
Test-Time Scaling: The primary bottleneck in scaling test-time compute for coding agents has shifted from generating more attempts to effectively storing and reusing experience from previous attempts.