We are excited to release the Prime Collective Communications Library, a low-level communication library built for decentralized training across the globe.
Communication libraries like Nvidia’s NCCL assume fast, stable connections—ideal for supercomputers, not the public internet.
PCCL is built with fault tolerance as a first-class citizen. There is no bad time to kill a PCCL peer, even with multiple concurrent all-reduces or during peer-reconfiguration. Every possible state of the system is designed to be recoverable, as validated by extensive stress tests.
In our testing PCCL achieves up to 45 Gbit/s of bandwidth across datacenters in western Europe and 25 GBit/s training intercontinental across North America and Western Europe.
We are releasing PCCL in hopes of accelerating research in the direction of distributed low communication optimization algorithms to further close the gap to centralized training.
Detailed Technical Report: https://arxiv.org/abs/2505.14065
Github: https://github.com/PrimeIntellect-ai/pccl
Docs: http://pccl.primeintellect.ai/
Install:
pip install pypccl
Traditional MPI libraries were designed primarily for CPU-node supercomputers.
Today we use MPI-like libraries such as NCCL to run deep learning programs, utilizing the fast interconnect available on modern enterprise GPUs. However, along with inheriting the MPI api surface, we also inherit its limitations. Each process runs the same program, and communicates with other processes by sending and receiving messages. Because the program is deterministic in terms of control flow, all computers will run the same collective communication operations in the same order, and thus the same messages will be sent and received by all processes. If any of the processes fail, the entire program will fail.
MPI programs are designed to run on a single supercomputer in a single datacenter within a single network. Thus, many MPI implementations assume local reachability. E.g. to use Meta’s Gloo across the public internet, a common - but theoretically unnecessary and practically slower - solution is to use a VPN. This was a necessary work around during the training of INTELLECT-1, which reduced throughput.
Peers in an MPI program are fixed at the start of the program. Naively, if one process fails, the entire job fails.
"Fault tolerance" in MPI usually means restarting the entire program from scratch, or at most being able to tolerate a number of peers failing, or relying on application level "hacks" that leave subtle failure scenarios unexplored that can manifest as a crash or stall given bad enough timing conditions.
Joining a new peer to an ongoing MPI job is not possible. For modern ML workloads, we would like to be able to
a) tolerate peers failing ungracefully
b) join or rejoin peers dynamically.
There are good reasons for why traditional MPI does not attempt to solve these problems.
Specifically, for any general program with arbitrarily nested control flow, it is essentially impossible to design a robust scheme to handle new peers joining that have fresh program state.
In the ML world, we are not interested in generalized scientific computing programs with arbitrarily complex control flow. Instead, we are interested in iterative optimization algorithms, which necessitate the repetition of fundamentally the same operations for every “training step”. In such a setting, robust dynamic membership is indeed possible. Peers either contribute to the training step, or they do not.
PCCL is a library that provides fault-tolerant collective communication primitives designed for the public internet.
The PCCL model is simple:
Recent advancements in the distributed learning literature have shown that synchronizing worker gradients at each step is not necessary for convergence. Optimization strategies like DiLoCo, which synchronizes worker-local weight deltas only every N inner steps are competitive with naive DDP. Crucially, as the model size grows, the more the gap between DiLoCo and DDP shrinks. PCCL was developed to take advantage of what presents a new opportunity of scaling language model training in a distributed setting.
PCCL allows one to easily implement schemes such as async DiLoCo, which implement one-step delayed parameter updates, which is a way to completely hide the communication from your reduce operation, as the next set of inner steps is computed concurrently. In the best case, the amount of inner steps is tuned such that the compute time matches your communication time precisely. This allows for the best balance of parallelism and communication frequency. Examples on recommended usage patterns and how to implement common distributed optimization strategies are available in the examples folder of the PCCL repository.
TLDR: very
PCCL passes extensive long-running stress-tests on all major socket implementations (Linux, macOS, Windows WSA) with a high frequency training loop where peers are rapidly spawned and killed with completely random timing to provoke every possible race condition and or crash.
As long as application code follows best practices of error recovery / retry logic, there is no bad time to kill a PCCL peer, no matter if multiple concurrent collective communications operations are ongoing that need to be partially aborted, awaited and or re-tried, shared state synchronization or any of the other phases. The shared state is not lost. It continues to be advanced by the training loop through the remaining peers, even under heavy peer churn.
PCCL can perform automatic topology optimization. This triggers bandwidth testing and subsequent construction of the optimal tour given the obtained cost edges.
If, for example, computers are colocated in the same datacenter, packets can often be locally delivered without ever bouncing the gateway. In this case, bandwidth is often within ~50GBit/s in most clouds. In such a scenario when utilizing topology optimization, the cost of “leaving” the data center is only payed twice. Suboptimal tours would pay this cost more frequently. Given this property, PCCL is not strictly restricted to transport over the internet and allows for seamless mixing of Ethernet-confined and public internet communication.
PCCL can effectively utilize cross-continental long-fat-pipe links through the use of multiple concurrent pipelined all reduces distributing packets over a large connection pool. This helps to aggregate bandwidth from routers enforcing per-flow fair-queuing.
In our testing we observe PCCL achieve a throughput of 25 Gbit/s in a setup of 18 peers spread across North America and Western Europe.
Without the involvement of undersea links, bandwidth can be increased beyond this:
In a less globally distributed setting, we can achieve even higher speeds:
Despite the fact that PCCL does not directly optimize for high-performance local networks, PCCL remains competitive with PyTorch’s Gloo over Ethernet.
If you want to see how we use PCCL, refer to the open source prime repository, which is our production-ready open source implementation of DiLoCo and async DiLoCo.