BLOG

State-of-the-art in Decentralized Training

This post explores various novel decentralized training approaches and how they can enable effective AI model training across globally distributed GPUs.

Open-source AI development still faces significant challenges in keeping pace with its closed-source competitors. The latter are rapidly expanding their capabilities by deploying co-located H100 clusters, each comprising up to hundreds of thousands of interconnected GPUs, to train their state-of-the-art models. To keep pace with Big Tech's rapid expansion of GPU clusters, the open-source community must overcome the limitations of traditional computing infrastructures and find ways to train on thousands of smaller, distributed GPU clusters across the globe.

At Prime Intellect, we're committed to addressing this gap by building infrastructure for decentralized AI development at scale. Our platform aggregates global compute resources and enable researchers to collaboratively train state-of-the-art models through distributed training across clusters.

This post explores various novel decentralized training approaches and how they can enable effective AI model training across globally distributed GPUs. We will discuss specific techniques, their trade-offs, and our development plans over the course of this year to democratize access to large-scale AI computing resources.

Distributed Training Basics

In a typical setting of distributed AI model training, there are three main methods for training models across multiple GPUs:

  1. Data parallelism: Each device independently performs forward and backward passes on a different shard of data, then aggregates their respective gradients through an all-reduce operation before updating the weights.
  2. Tensor parallelism: This method splits a model across multiple GPUs horizontally. Each weight matrix of the model is divided among the devices, necessitating communication at every layer.
  3. Pipeline parallelism: This technique splits a model across multiple GPUs in a "vertical" way. Each GPU holds a part of a block of layers, and each GPU communicates activations/gradients to the next one in the pipeline.

Each method has its disadvantages: data parallelism alone cannot train large models; tensor parallelism requires a lot of inter-GPU communication; and pipeline parallelism, though effective, is complex to implement and requires advanced scheduling to avoid GPU idle time.

These methods are often combined (along with more advanced optimizations like optimizer sharding, activation recomputation, etc.) to achieve high model FLOPs utilization (MFU).

Paradigm Shift with Decentralized Training.

Decentralized training across globally distributed GPUs introduces a radical shift in how we need to approach distributed AI model training.

In the decentralized training paradigm, we have access to relatively cheap compute power, but communication between instances is costly. For instance, we could harness unused compute (e.g., cheap spot instances) from various sources, but with the drawback that these instances can be located all around the world.

This poses several technical challenges:

  • Slow interconnect: Computing islands in a decentralized training setup could be distributed globally. Inter-node bandwidth can range from as low as 200Mb/s to 100Gb/s. For comparison, cross-node communication in co-located clusters with InfiniBand can go up to 800Gb/s
  • Non-homogeneous hardware: Not all computing islands use the same hardware. Variations may include different brands or generations, different specs (flops/s, limited amount of VRAM on consumer GPUs) and some islands may have more computing power available than others.
  • On/Off ramping of compute: The amount of available compute might not be constant, with new devices and clusters coming and going.
  • Fault tolerance: Some clusters may be less reliable than others. Decentralized training must be fault-tolerant; a device from any computing island could become unavailable at any time without stopping the training process.
  • And much more…

Addressing these issues requires novel approaches that reduce communication overhead and enhance the flexibility and fault tolerance of training processes. The following sections outline some of these approaches:

Distributed Low-Communication Training (DiLoCo)

Recent work by Google DeepMind has introduced an approach that enables the training of language models on islands of devices that are poorly connected. This method allows for data parallel training on these different islands, requiring synchronization of gradients only every 500 steps.

In data parallelism, all workers typically synchronize their gradients after each step, just before updating the model's weights. The gradient matches the model weight in shape and size. For a 1B model, it means 2GB of data in bfloat16, that need to be sent by each worker at each step. With a typical bandwidth in a decentralized training setup, such as 1Gb/s (equivalent to 0.125GB/s), communicating the gradient will take at least 16 seconds, leading to significant idle time for the GPUs.

However, what if instead of syncing the gradient at every step you would sync only every 500 steps? This would spread the 16 seconds of communication over minutes of computation, maximizing GPU utilization and reducing training time.

Doing so is not easy, because you cannot trivially delay the gradient update. DiLoCo introduces an inner-outer optimization algorithm that allows both local and global updates. Each worker independently updates its weights multiple times using a local AdamW optimizer (inner optimization). Every ~500 updates, the algorithm performs an outer optimization using the Nesterov momentum optimizer, which synchronizes all workers' pseudo gradients (the sum of all local gradients).

Additional to the reduced communication volume, Douillard et al. show that DiLoCo is very robust to scaling individual workers up or down and having an adaptive amount of total compute. This flexibility could enable a decentralized training solution to dynamically adjust the total compute based on availability and competitive pricing, by adding or removing nodes during the training.For example, if a training job were running across 8 globally distributed H100 nodes and a more cost-effective node became available, an orchestration solution could replace an existing, more expensive node, or decide to increase the compute power for the training job to complete the job faster.

Strengths:

  • Requires minimal communication between instances, ideal for distributed training on nodes with low internet speeds.
  • Robust to changes in the number of workers and the total compute power available.

Limitations:

  • DiLoCo has only been tested up to a model size of 400M parameters.
  • DiLoCo requires each instance to have sufficient GPU memory to hold the model's parameters, gradients, and optimizer states (general data parallelism limitation, except for slower off-loading methods).
  • DiLoCo is limited to a synchronous setting  where the weight updates are done at the same time, making it harder to work in heterogeneous settings where certain workers might be slower than others. Although Google Deepmind has recently extended this work to the asynchronous case, enabling DiLoCo’s application on heterogeneous devices.

DiLoCo faces the same limitation as traditional data parallelism; it cannot train models that exceed the memory capacity of its co-located GPUs. To address this, a follow-up work by the same authors at DeepMind has extended DiLoCo to support the training of sparse models (aka MoEs), on poorly connected islands of compute:

DiPaCo: Distributed Path Composition

Sparse models (MoEs) allow to train gigantic models while maintaining manageable training costs, as they do not use all weights during each forward and backward pass. In these models, different inputs will take different paths in the model based on a routing mechanism. Traditional open sparse models (like Mixtral or Databrick’s DRBX model) implement a token-based routing at each transformer block level. This method requires frequent communication for every token in the sequence at each block, which won’t work efficiently across poorly connected devices.

To avoid that much communication, DiPaCo use a coarse routing mechanism. This routing is made at the sequence level (to contrast with token level), greatly reducing the amount of communication needed at each step. Additionally, the routing decisions are made offline before training, allowing data to be pre-sharded. Each worker then processes data specific to one path only.

In their experiments, they trained a sparse model with 256 possibles path, each having 150m active parameters which outperforms a 1B dense parameter model baseline. Each of these paths are not totally independent, otherwise the approach would be equivalent to train 256 different models. The paths share common blocks that are kept in sync using DiLoCo.

Google Gemini Ultra Training

Google has successfully trained their flagship model, Gemini Ultra, using a distributed training infrastructure across multiple data centers. This training infrastructure underscores the challenges faced by even the largest organizations in maintaining all hardware co-located within a single location, highlighting the need for globally distributed training across clusters.

Specifically, Google leveraged data parallelism across multiple TPUv4 Superpods—each consisting of 4096 TPUs—located in various data centers.  The best-in-class network latencies and bandwidths of Google’s infrastructure allowed them to exploit model parallelism within superpods and data-parallelism across superpods.

They do not disclose any further details on the training infrastructure setup. However, a simple back-of-the-envelop calculation based on the rumoured 1e26 hardware FLOPs that were used in the final Gemini Ultra training run, along with a training time of ~100 days and a reasonable hardware FLOPs utilization of 60%, would imply the use of about 18 superpods across different data centers.

(math: 1e26 / (TPU v4 with 275 TFLOPs in bfloat16 * 4000 * 60% HFU * 100 days

→ 1e26 / (275e12 * 4000 * 0.6 * 100 * 24 * 60 * 60) 17.5)

Assuming that Gemini Ultra is a 2T parameter MoE model, similar to GPT-4, and trained using mixed-precision with bfloat16, synchronizing gradients with a simple ring all-reduce method after each training step would require each superpod to transmit and receive 4TB of data. Without more advanced strategies to interleave communication and computation, this extensive data transfer volume would significantly slow down the training speed, even with Google’s great cross data center interconnect speeds.

Next, we explore promising decentralized training paradigms that do not rely on Google’s world-class proprietary infrastructure. These methods, such as SWARM parallelism, enable globally distributed training on poorly connected, heterogeneous and unreliable devices:

SWARM Parallelism

SWARM parallelism is a model-parallel training algorithm designed for devices with poor connectivity and varying reliability. Ryabinin et al. show the feasibility of training billion parameter scale LLMs on preemptible instances and a network bandwidth of less than 200Mb/s while achieving high training throughput.

This method presents a more flexible form of pipeline parallelism. The path for completing a full forward and backward pass is not fixed and may vary over time. Each worker may send its output to any other worker in the subsequent stage. Faster devices receive more tasks to prevent GPU idle time, and also enabling the use on non-homogeneous hardware. If a worker dies, the tasks assigned to it are redirected to others, making the algorithm fault tolerant.

Paths are determined stochastically and on the fly. Each worker in a stage is placed in a priority queue based on its recorded performance in the pipeline over time. Workers that consistently perform tasks faster—due to better hardware or are co-located with preceding workers—are more likely to be picked. If a worker fails to respond to a request, it is temporarily excluded from the queue until it re-announces itself. This dynamic allows the system to operate on preemptible instances or adaptively rebalance nodes within the swarms.

Additionally, they have experimented with gradient and activation quantization (to 8-bit) to further minimize the communication overhead.

In their experiments, the researchers simulate low-bandwidth and high-latency environments and show that it is feasible to train billion-scale transformer models with less than 200Mb/s bandwidth.

Strengths:

  • Enables fault-tolerant training on much cheaper spot instances.
  • Supports training on heterogeneous devices.
  • Theoretically, SWARM parallelism would allow for decentralized training that is computationally equivalent to centralized approaches, thereby reducing the technical risks associated with adopting this methodology.

Limitations:

  • The scalability of the training efficiency to 10B+ parameter model sizes has not been evaluated. However, SWARM parallelism could potentially enable efficient decentralized training of very large models when applying the concept from their setup (single legacy / consumer GPUs + low bandwidth at 200Mb/s) to cloud compute setups with access to large amounts of on-demand and spot instances of up to 8 high-capacity (A100/H100) chips in one server (connected with NVLink) + comparatively high network bandwidth of 500Mb/s - 100Gb/s.
  • As a close reference for this claim, a paper called Varuna released by Microsoft in 2022 demonstrates how to scale distributed training across 100+ spot instances for models with up to 200B parameters (without InfiniBand connections):

Unlike Google's proprietary efforts, SWARM parallelism is open-source, allowing anyone to begin decentralized training across clusters. Although it is primarily a research codebase not intended for production use, its contributions to the field are significant. You can explore and contribute to the project here: https://github.com/yandex-research/swarm. SWARM is built upon the hivemind project:

Hivemind

Hivemind is a framework to perform various decentralized training algorithms. It has been used to train a text to image model across a pool of volunteer compute. At its core, Hivemind utilizes a distributed hash table spread across each worker to communicate metadata and synchronize them. This distributed hash table is implemented using the libp2p open-source project, with an alternative version using IPFS. The GitHub repo: https://github.com/learning-at-home/hivemind.

Hivemind, as a framework, supports several decentralized training algorithms. The most recent one, SWARM, was discussed earlier. Originally, the project was designed with specific research contributions in mind, including Moshpit SGD, Decentralized Mixture-of-Experts, and Training Transformers Together.

If you are looking to deploy a decentralized Hivemind training run today, the simplest method is through our decentralized compute platform at app.primeintellect.ai by selecting the "Prime Intellect Hivemind" base image for deploying your workloads.

Another notable framework is Petals which is built on top of Hivemind. It implements a variant of swarm parallelism with inference and light fine-tuning in mind for several recent LLMs like Mixtral or Llama.

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Similar to SWARM parallelism, Varuna uses a combination of pipeline and data parallelism to train models using relatively slow interconnect speeds, compared to InfiniBand connections, and also works on spot instances.. However, it differs by using multi-GPU instances instead of the single V100/T4s used in SWARM parallelism, and it significantly scales the number of pipeline stages, facilitating the training of large models.

Varuna is not trained in a decentralized manner across multiple clusters but rather within a single cluster with low interconnect speeds. Additionally, it lacks some of the recent features of SWARM parallelism, such as the "temporary randomized" pipeline approach that supports training on heterogeneous devices. Despite these differences, it is, in our view, one of the most notable developments in this field, demonstrating how to efficiently scale model training across separate spot instances to hundreds of GPUs and model sizes up to 200B parameters.

Strengths:

  • Demonstrated ability to efficiently scale to training runs involving over 100 spot instances and model sizes up to 200B parameters.

Limitations:

  • Requires large batch sizes for most efficient training. Training the largest models distributed with Varuna necessitates a large number of pipeline stages.
  • Initially designed for homogeneous environments and spot instances within a single location. However, integrating insights from SWARM parallelism with Varuna's impressive scaling capabilities could enable the concept's expansion to large-scale training runs across clusters.

Other Decentralized Training Techniques

Other complementary decentralized training techniques try to lower the communication requirements between nodes by for example:

  • More efficient gradient communication: compressing/sparsifying the gradients in data parallel communication to sync gradients after the backward pass. These include works like:
  • Overlapping communication and computation: hiding the synchronization latency, can be achieved by combining parallelism techniques
    • SWARM parallelism itself supports Delayed Parameter Updates (DPUs) to further improve the training efficiency by performing the optimizer step in parallel with processing the next batch.
  • Reducing memory usage of model pre-training:  Approaches like (ReLoRA and GaLore) allow to fit the pre-training of large models on consumer GPUs by updating fewer parameters at each step. By fitting larger models in memory, these methods reduce the need for inter-node communication, thus helping decentralized training.
  • Scheduling algorithms for geo-distributed settings: Decentralized Training of Foundation Models in Heterogeneous Environments introduces an evolutionary algorithm to find the optimal arrangement to hide communication and increase compute utilization for a given compute island topology.

Scaling Decentralized Training

To date, no one has successfully scaled this research to actually train state-of-the-art models.

Models like GPT-4 or Gemini are said to be up to 2T parameter sparse models with up to million-token context windows. Leading open source LLMs like Mixtral 8x22b, Cohere’s Command R or the upcoming large Llama 3 are at the 100B+ scale. Even state-of-the-art models in other fields, like protein language models, are being scaled to 100B parameter model sizes. This is still one to two orders of magnitude more than what current decentralized training efforts have been able to achieve.

At Prime Intellect we will scale this research to train SOTA models through distributed training across multiple clusters.

Prime Intellect Decentralized Training

At Prime Intellect, we are making use of distributed, low-communication training approaches to enable the training of large AI models using distributed resources, reducing costs and democratizing AI development.

We are focused on developing decentralized training frameworks that exhibit several advantageous properties compared to current research. These include:

  • Enabling on-demand multi-node training at scale across hundreds/thousands of GPUs and small clusters,
  • Seamless orchestration across multiple clusters,
  • Integrating fault tolerance to handle node failures and enabling training on much cheaper spot instances, thereby enabling the use of all idle compute in the world,
  • Providing flexibility to ramp up or down compute resources during training, allowing for dynamic adjustments based on availability and competitive pricing,
  • Scaling to larger model sizes (10-100B+) and context lengths,
  • Requiring low network bandwidth,
  • Incorporating optimizations to automatically minimize network latency,
  • Avoiding slower convergence or instabilities during training.

We believe there should be an open-source stack that offers solutions for orchestration across multiple clusters, efficiency optimizations, handling node failures, infrastructure, monitoring, and much more. We will have more to share about our research efforts in this direction soon.

Join us in creating a free market for training and inference compute, breaking free of long-contracts and interconnect limitations.

We have recently raised a $5.5 million seed round from an awesome group of investors and angels, such as Clem from HuggingFace and Dylan Patel from SemiAnalysis, and are actively hiring stellar AI research engineers to work on this mission.

Apply here


Thanks to Marco Bellagente, Andres Felipe Cruz Salinas, Piotr Mazurek, Andreas Köpf, Felix and many more for feedback on drafts of this post!