NVIDIA, Microsoft, and OpenAI have unveiled the Multipath Reliable Connection (MRC), a groundbreaking RDMA transport protocol designed to optimize AI training across multiple GPUs by distributing traffic over various network paths. This innovation, which is proven on NVIDIA Spectrum-X Ethernet hardware, allows the protocol to efficiently handle congestion and network failures without impacting the overall performance. MRC, currently deployed by OpenAI and Microsoft on their largest AI supercomputers, enhances the existing RoCEv2 RDMA technology and is also made available through the Open Compute Project to promote widespread industry adoption for AI networking.
AMD: AMD manufactures CPUs, GPUs, and accelerators for high-performance computing and AI applications. It contributed to the collaborative specification of MRC through the Open Compute Project. The protocol enhances Ethernet-based RDMA for resilient AI networking at scale.
Intel: Intel produces processors, accelerators, and Ethernet components for AI and data center environments. It participated in developing the MRC transport protocol alongside key partners. MRC addresses congestion and failures in large RDMA-based GPU clusters.
NVIDIA: NVIDIA provides GPUs and AI-native Ethernet networking platforms like Spectrum-X designed for massive AI training clusters. It collaborated with OpenAI and Microsoft to prove and optimize MRC first on Spectrum-X hardware. MRC enables resilient multi-path data transfer critical for gigascale AI supercomputers.
OpenAI: OpenAI is an AI research organization developing frontier models requiring enormous GPU clusters. It led the release of MRC, an open RDMA transport protocol, in partnership with industry leaders. MRC is already deployed across OpenAI’s largest supercomputers to handle network failures without interrupting training.
Broadcom: Broadcom develops semiconductors and ASICs for data center networking and connectivity solutions. It joined NVIDIA, OpenAI, and others to create the MRC protocol. This advances multi-path reliable connections over Ethernet for AI supercomputer fabrics.
Microsoft: Microsoft delivers Azure cloud infrastructure supporting large-scale AI workloads and supercomputing. It partnered with OpenAI, NVIDIA, and others to develop and deploy MRC in data centers like Fairwater. The protocol improves network reliability to reduce wasted GPU time during AI training.
Deployment: MRC is deployed in production by OpenAI and Microsoft across their largest AI supercomputers, including sites powered by NVIDIA hardware.
Technology: MRC enhances RoCEv2 RDMA by allowing a single connection to spread traffic across multiple network paths, bypassing congestion and failures in microseconds.
Open Source: The MRC specification is released via the Open Compute Project, enabling broad industry adoption for AI networking.
