Cursor achieves 38% speedup in CUDA kernel optimization with Nvidia partnership

Cursor has developed a multi-agent system that autonomously builds and maintains complex software and has recently collaborated with NVIDIA to optimize CUDA kernels. Over the course of three weeks, the system achieved a 38% geomean speedup across 235 kernel problems, significantly outperforming traditional engineering methods that usually take months or years. This optimization is crucial as CUDA kernels underlie the software that supports AI model training and inference, with faster kernels leading to improved GPU utilization, reduced latency, and lower operational costs.

Cursor: Cursor is an AI-native code editor forked from Visual Studio Code, designed to enhance developer productivity through deep AI integration for tasks like code generation, refactoring, and debugging. The company has developed a multi-agent system capable of autonomously building, maintaining, and optimizing complex software, which they tested by partnering with NVIDIA to optimize CUDA kernels.[news] This system achieved significant speedups on GPU kernel problems, validating multi-agent architectures for novel challenges and informing future enhancements to Cursor’s core product.[news]
NVIDIA: NVIDIA is a leading technology company specializing in graphics processing units (GPUs) and AI hardware solutions, powering model training and inference workloads.[news] NVIDIA collaborated with Cursor on applying a multi-agent system to optimize CUDA kernels for Blackwell 200 GPUs using their SOL-ExecBench benchmark across 235 real-world problems from production models.[news] The partnership demonstrated the system’s ability to deliver substantial performance improvements typically requiring extensive human expertise.[news]
SOL-ExecBench: SOL-ExecBench is a benchmarking tool developed by NVIDIA for generating and evaluating GPU kernel optimization problems from production open-source models.[news] In the Cursor-NVIDIA collaboration, it generated 235 diverse problems spanning LLMs, diffusion, and multimodal models, and benchmarked solutions on 27 Blackwell 200 GPUs against baselines and hardware limits.[news] It ensures valid evaluations by invalidating cheating tactics like caching that exceed theoretical performance.[news]
Blackwell 200 GPUs: Blackwell 200 GPUs are NVIDIA’s advanced AI GPUs targeted for high-performance computing in model training and inference.[news] Cursor’s multi-agent system optimized CUDA kernels for these GPUs from scratch, achieving speedups across a long-tail of problems in a three-week autonomous run.[news] The optimizations involved low-level assembly and novel strategies, outperforming baselines on most benchmarks.[news]

{“Impact”: “Faster CUDA kernels improve GPU utilization, reduce latency, and lower costs for AI model serving.”, “Benchmark”: “SOL-ExecBench was used to generate and benchmark problems from primary models like Deepseek, Qwen, and Stable Diffusion.”, “Partnership”: “Cursor collaborated with NVIDIA to evaluate multi-agent kernel optimization on authentic AI workloads.”}