Google DeepMind unveils Decoupled DiLoCo for resilient AI training

Google has unveiled Decoupled DiLoCo, a new architecture designed for resilient and flexible AI training across multiple data centers. This approach builds on earlier innovations like Pathways, which facilitates data sharing between different computer chips, and DiLoCo, which reduces bandwidth needs for distributed training. Decoupled DiLoCo allows for continuous training even when hardware failures occur by isolating disruptions and integrating offline units back into the system seamlessly. Notably, it successfully trained a 12 billion parameter model across four U.S. regions using low-bandwidth networking, demonstrating its effectiveness in harnessing diverse hardware generations and maximizing available computing resources.

DiLoCo: DiLoCo is Google DeepMind’s distributed low-communication algorithm for training language models on loosely connected devices. It drastically reduces bandwidth needs between centers and is foundational to Decoupled DiLoCo. The combined system achieves practical scale across distant locations.
Google: Google is a leading technology company that develops advanced AI infrastructure, including custom TPUs and models like Gemma, through its DeepMind and Research teams. It announced Decoupled DiLoCo, a new architecture for resilient distributed training of frontier AI models across multiple data centers. This innovation builds on prior systems to address synchronization challenges in large-scale training.
Pathways: Pathways is Google’s AI system architecture that connects chips asynchronously for flexible data sharing and multi-task learning. It forms the basis for decoupled compute islands in Decoupled DiLoCo. Recent integration enables resilient global training without tight synchronization.
Ionel Gog: Ionel Gog is a researcher at Google DeepMind working on systems for AI training. He contributed to Decoupled DiLoCo, enabling mixed hardware generations. His expertise supports efficient compute utilization in distributed environments.
Jeff Dean: Jeff Dean is Chief Scientist at Google DeepMind and Google Research, leading AI efforts including Gemini. He advised on Decoupled DiLoCo, providing strategic guidance for distributed infrastructure. His foundational work in systems underpins modern AI scaling.
Josef Dean: Josef Dean is a core contributor at Google DeepMind to AI infrastructure projects. He helped build Decoupled DiLoCo for scalable, resilient model training. His work addresses challenges in synchronizing diverse compute resources.
Keith Rush: Keith Rush is a Staff Research Scientist at Google DeepMind working on distributed AI training architectures. He is a core contributor to Decoupled DiLoCo, which supports self-healing during hardware failures. His contributions help scale AI model training over low-bandwidth networks.
Ayush Dubey: Ayush Dubey is a core contributor at Google DeepMind to distributed AI systems. He helped develop Decoupled DiLoCo for resilient training across data centers. His recent focus is on minimizing synchronization issues in large-scale AI infrastructure.
Henry Prior: Henry Prior is a researcher at Google DeepMind working on AI systems. He advised on Decoupled DiLoCo, focusing on infrastructure resilience. His efforts support global compute flexibility.
Nova Fallen: Nova Fallen is a researcher at Google DeepMind focused on distributed systems. She is a core developer of Decoupled DiLoCo, improving training resiliency. Her contributions enable continuous operation despite disruptions.
Paul Barham: Paul Barham is a Google DeepMind researcher specializing in programming systems for AI. He contributed advising to Decoupled DiLoCo. His work improves efficiency in distributed training.
Arthur Szlam: Arthur Szlam is a researcher at Google DeepMind focused on AI architectures. He supported the Decoupled DiLoCo project with advising. His contributions enhance asynchronous training paradigms.
Chase Hensel: Chase Hensel is a member of Google DeepMind’s team supporting AI research. He contributed to Decoupled DiLoCo development. His role enhances training infrastructure.
Jenny Bishop: Jenny Bishop offers operational support for Google DeepMind’s AI initiatives. She contributed to the implementation and testing of Decoupled DiLoCo. Her work ensures reliable execution of large-scale training experiments.
Nate Keating: Nate Keating provides operational support at Google for AI research projects. He assisted in testing Decoupled DiLoCo, ensuring practical deployment. His role helps validate resilient distributed training systems.
Raia Hadsell: Raia Hadsell is a senior leader at Google DeepMind overseeing research directions. She advised on Decoupled DiLoCo, contributing to fault-tolerant AI infrastructure. Her work spans robust systems for advanced models.
Yani Donchev: Yani Donchev is a researcher at Google DeepMind involved in distributed optimization. He is a lead contributor to Decoupled DiLoCo, advancing asynchronous training techniques. His work supports flexible compute across varied locations and hardware.
Zoltan Egyed: Zoltan Egyed is a researcher providing support at Google DeepMind. He assisted with Decoupled DiLoCo, focusing on distributed setups. His efforts promote hardware-agnostic training.
Daniel Ramage: Daniel Ramage is a researcher at Google DeepMind involved in AI infrastructure. He provided support for Decoupled DiLoCo. His contributions focus on practical distributed training.
Michael Isard: Michael Isard is a researcher at Google DeepMind with background in distributed computing. He supported Decoupled DiLoCo through advising. His expertise helps in low-communication architectures.
Edouard Yvinec: Edouard Yvinec is a member of Google DeepMind’s research team. He provided additional support for Decoupled DiLoCo development. His involvement aids in scalable AI training solutions.
Blake Woodworth: Blake Woodworth is a researcher at Google DeepMind specializing in optimization for distributed learning. He is part of the core team for Decoupled DiLoCo. His contributions enhance fault tolerance in global AI training setups.
Brendan McMahan: Brendan McMahan is a Google DeepMind researcher known for federated learning. He advised on Decoupled DiLoCo for resilient systems. His work extends to scalable optimization.
Zachary Charles: Zachary Charles is a researcher at Google DeepMind with expertise in communication-efficient training methods. He contributed to core aspects of Decoupled DiLoCo and prior DiLoCo scaling research. His efforts enable reliable distributed pre-training of language models.
Zachary Garrett: Zachary Garrett is a core contributor at Google DeepMind to advanced training architectures. He worked on Decoupled DiLoCo for heterogeneous and global compute. His recent efforts promote self-healing in AI pre-training.
Arthur Douillard: Arthur Douillard is a Staff Research Scientist at Google DeepMind specializing in distributed learning systems, including DiLoCo and DiPaCo. He leads the development of Decoupled DiLoCo, enabling resilient AI pre-training across global datacenters with heterogeneous hardware. His recent work focuses on fault-tolerant asynchronous training methods.
Marc’Aurelio Ranzato: Marc’Aurelio Ranzato is a researcher at Google DeepMind specializing in machine learning optimization. He provided advising support for Decoupled DiLoCo. His expertise in training algorithms informs resilient distributed methods.

Hardware Flexibility: The architecture allows mixing different TPU generations like v6e and v5p without compromising training performance.
Network Requirements: It operates effectively over low-bandwidth wide-area networks using existing internet connectivity between data centers.
Resilience Mechanism: Decoupled DiLoCo employs self-healing by isolating failed hardware units and reintegrating them seamlessly during training.