OpenAI has announced the release of the Multipath Reliable Connection (MRC) networking protocol, developed in collaboration with major industry players such as AMD, Broadcom, Intel, Microsoft, and NVIDIA. This new protocol aims to enhance the efficiency and reliability of large AI training clusters by addressing synchronization challenges inherent to AI supercomputers, allowing them to move data across record numbers of chips effectively. MRC utilizes multipath RDMA connections to allocate traffic and recover from failures swiftly, thereby improving performance and minimizing wasted GPU time. Released through the Open Compute Project, MRC is now available for the broader industry to adopt, marking a significant advancement in AI infrastructure development.
AMD: AMD designs semiconductors including processors and accelerators tailored for AI workloads. It collaborated with OpenAI and other tech leaders to develop the MRC protocol. This partnership advances resilient networking for large-scale AI training environments.
Intel: Intel produces processors and networking technologies for data center and AI applications. It contributed to the MRC protocol’s creation with OpenAI and partners. MRC ensures seamless operation across diverse hardware in expansive AI training setups.
NVIDIA: NVIDIA builds GPUs and Ethernet networking solutions like Spectrum-X for AI data centers. It partnered on MRC, where the protocol was first validated on its platforms. MRC enhances multipath traffic distribution for reliable large-cluster AI training.
OpenAI: OpenAI is an artificial intelligence research organization focused on developing advanced models and the infrastructure to train them at massive scale. It partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to release the open-source Multipath Reliable Connection (MRC) networking protocol. MRC has been deployed across OpenAI’s largest supercomputers to enable faster and more reliable data synchronization in AI training clusters.
Broadcom: Broadcom develops networking semiconductors and software for mission-critical AI infrastructure. It co-developed MRC to enable rapid failure recovery and consistent performance in massive GPU clusters. Its Thor NICs and Tomahawk switches integrate with MRC for enhanced AI scalability.
Microsoft: Microsoft provides cloud computing and AI platforms through Azure, supporting large-scale model training. It co-developed MRC and implements it in its Fairwater supercomputers used by OpenAI. This effort promotes open standards for efficient AI networking.
Andrew Mayne: Andrew Mayne hosts the OpenAI Podcast, featuring in-depth conversations with OpenAI researchers and builders on AI advancements. A former OpenAI technical staff member, he now invests in AI as a general partner at Zero Shot Fund. He hosts the episode on AI supercomputer networking, interviewing experts on the new MRC protocol.
Greg Poynter: Greg Poynter focuses on maximizing GPU utilization efficiency in compute-intensive environments. He collaborates with OpenAI on advanced supercomputer networking solutions. He joins the OpenAI Podcast to discuss reliable data transfer across vast chip arrays and the industry-wide benefits of the MRC protocol.
Mark Handley: Mark Handley is Professor of Networked Systems at University College London and Tech Lead for Networking at OpenAI, specializing in high-performance networks for AI clusters. He leads efforts on supercomputer networking designs used to train OpenAI’s latest models. In the OpenAI Podcast, he explains the technical challenges of data movement and the MRC protocol’s role in addressing them.
Open Standard: Released via the Open Compute Project, MRC enables the broader industry to adopt improved networking for AI clusters.
Protocol Purpose: MRC addresses synchronization challenges in AI supercomputers by using multipath RDMA connections to distribute traffic and recover from failures quickly.
Deployment Status: MRC powers OpenAI’s frontier model training on supercomputers with Oracle and Microsoft infrastructure.
