PSEEDR

OpenAI and Hardware Consortium Open-Source MRC Protocol to Overcome 100,000-GPU Cluster Bottlenecks

The Multipath Reliable Connection (MRC) protocol replaces traditional BGP routing with SRv6 to eliminate tail latency in massive AI supercomputers.

· 3 min read · PSEEDR Editorial

On May 6, 2026, OpenAI, in collaboration with a consortium of major hardware providers including AMD, Broadcom, Intel, Microsoft, and NVIDIA, officially open-sourced the Multipath Reliable Connection (MRC) protocol through the Open Compute Project, fundamentally restructuring how massive AI supercomputers handle network traffic.

As artificial intelligence training clusters scale beyond the 100,000-GPU threshold, industry analysis from Converge Digest suggests traditional networking architectures have emerged as the primary bottleneck for training stability and cost efficiency. To address this, OpenAI, alongside a consortium including AMD, Broadcom, Intel, Microsoft, and NVIDIA, announced the open-sourcing of the Multipath Reliable Connection (MRC) protocol on May 6, 2026, according to a joint press release from OpenAI and Broadcom. Released through the Open Compute Project (OCP) under an open license, Broadcom notes MRC represents a structural departure from conventional dynamic routing protocols, specifically targeting the tail latency issues that plague massive compute environments.

According to technical documentation reviewed by Glitchwire, the foundation of MRC relies on two major shifts in network traffic management: multipath data splitting and SRv6 source routing. Rather than sending data sequentially over a single determined path, MRC splits data packets across hundreds of paths simultaneously. Glitchwire reports these packets are then reassembled at the destination based on memory addresses. To facilitate this, the protocol replaces the Border Gateway Protocol (BGP) with SRv6 (Segment Routing over IPv6). By utilizing SRv6, the sender encodes the exact network path directly into the packet, relying on static forwarding tables. Converge Digest notes this architectural choice reduces network switches to simple forwarding devices, removing the need for them to compute routing tables dynamically.

This transition enables a highly simplified physical infrastructure. Broadcom states that traditional networking requires three to four layers of switches to connect clusters of this magnitude. In contrast, MRC utilizes a multi-plane design that allows operators to connect over 100,000 GPUs using only two layers of switches. Furthermore, the protocol provides microsecond-level failover resilience. OpenAI has demonstrated that MRC can maintain active training tasks even during core switch reboots and frequent link jitters without measurable impact on the workload. According to OpenAI, if a link fails, the system bypasses the issue in microseconds without requiring the switch to recalculate routing tables.

In production, MRC is already operating at an unprecedented scale. As of May 2026, OpenAI confirmed the protocol is actively deployed across its largest NVIDIA GB200 supercomputers. Glitchwire reports this includes the flagship Stargate I data center campus located in Abilene, Texas, which runs on Oracle Cloud Infrastructure (OCI), as well as Microsoft's Fairwater data center. The Stargate initiative represents a broader $500 billion multi-site project across the United States, with the Abilene site serving as the primary operational location built in partnership with Oracle, according to Converge Digest.

The operational impact of MRC is evident in OpenAI's recent product iterations. The organization confirmed in its release that MRC was utilized to train its latest frontier models, which currently power ChatGPT and the relaunched Codex autonomous coding agent, a tool reintroduced to the market between 2025 and 2026. Converge Digest analysis indicates the scaling of Stargate-class clusters made traditional BGP-based networking and tail latency the primary bottlenecks for training stability, necessitating this exact type of networking intervention.

Despite its advantages, the adoption of MRC introduces specific hardware dependencies and complexities. Glitchwire notes the protocol requires SRv6-capable network interface cards (NICs) and switches, creating a strict hardware prerequisite for deployment. Additionally, Broadcom documentation shows the process of splitting and reassembling packets across hundreds of paths increases the computational complexity at the destination memory interface. Converge Digest points out that the specific performance overhead of this reassembly logic on the GB200 Blackwell NICs remains an unknown factor in the broader industry adoption. Furthermore, Glitchwire adds it is currently unclear whether MRC-compliant firmware will be made available for older generation hardware, such as the NVIDIA H100 or A100 GPUs.

By releasing MRC through the OCP, Converge Digest suggests OpenAI and its partners are positioning the protocol against existing high-performance networking standards, including the Ultra Ethernet Consortium (UEC) specifications, NVIDIA's proprietary InfiniBand (Quantum-2 and Quantum-3), and RoCE v2. As the industry continues to build infrastructure capable of supporting next-generation AI models, Glitchwire concludes the open-source availability of MRC provides a tested, production-ready alternative for managing the extreme networking demands of 100,000-GPU clusters.

Key Takeaways

  • OpenAI and a consortium of hardware leaders open-sourced the MRC protocol via the Open Compute Project to address tail latency in 100,000-plus GPU clusters.
  • MRC replaces dynamic BGP routing with SRv6 source routing, allowing packets to be split across hundreds of paths and reassembled at the destination memory address.
  • The protocol enables a simplified two-layer network architecture and provides microsecond-level failover resilience during hardware reboots.
  • MRC is currently deployed on OpenAI's largest NVIDIA GB200 supercomputers, including the Oracle-partnered Stargate I site in Abilene, Texas, and Microsoft's Fairwater data center.
  • The technology was used to train the latest frontier models powering ChatGPT and the relaunched Codex autonomous coding agent.

Sources