Authors

Announcements

May 20, 2025

PCCL - Prime Collective Communications Library

We are excited to release the Prime Collective Communications Library, a low-level communication library built for decentralized training across the globe.

Communication libraries like Nvidia’s NCCL assume fast, stable connections—ideal for supercomputers, not the public internet.

PCCL is built with fault tolerance as a first-class citizen. There is no bad time to kill a PCCL peer, even with multiple concurrent all-reduces or during peer-reconfiguration. Every possible state of the system is designed to be recoverable, as validated by extensive stress tests.

In our testing PCCL achieves up to 45 Gbit/s of bandwidth across datacenters in western Europe and 25 GBit/s training intercontinental across North America and Western Europe.

We are releasing PCCL in hopes of accelerating research in the direction of distributed low communication optimization algorithms to further close the gap to centralized training.

Detailed Technical Report: https://arxiv.org/abs/2505.14065

Github: https://github.com/PrimeIntellect-ai/pccl

Docs: http://pccl.primeintellect.ai/

Install:

pip install pypccl

Motivation

Traditional MPI libraries were designed primarily for CPU-node supercomputers.

Today we use MPI-like libraries such as NCCL to run deep learning programs, utilizing the fast interconnect available on modern enterprise GPUs. However, along with inheriting the MPI api surface, we also inherit its limitations. Each process runs the same program, and communicates with other processes by sending and receiving messages. Because the program is deterministic in terms of control flow, all computers will run the same collective communication operations in the same order, and thus the same messages will be sent and received by all processes. If any of the processes fail, the entire program will fail.

MPI programs are designed to run on a single supercomputer in a single datacenter within a single network. Thus, many MPI implementations assume local reachability. E.g. to use Meta’s Gloo across the public internet, a common - but theoretically unnecessary and practically slower - solution is to use a VPN. This was a necessary work around during the training of INTELLECT-1, which reduced throughput.

Peers in an MPI program are fixed at the start of the program. Naively, if one process fails, the entire job fails.
"Fault tolerance" in MPI usually means restarting the entire program from scratch, or at most being able to tolerate a number of peers failing, or relying on application level "hacks" that leave subtle failure scenarios unexplored that can manifest as a crash or stall given bad enough timing conditions.
Joining a new peer to an ongoing MPI job is not possible. For modern ML workloads, we would like to be able to
a) tolerate peers failing ungracefully
b) join or rejoin peers dynamically.

There are good reasons for why traditional MPI does not attempt to solve these problems.
Specifically, for any general program with arbitrarily nested control flow, it is essentially impossible to design a robust scheme to handle new peers joining that have fresh program state.

Why ML is different

In the ML world, we are not interested in generalized scientific computing programs with arbitrarily complex control flow. Instead, we are interested in iterative optimization algorithms, which necessitate the repetition of fundamentally the same operations for every “training step”. In such a setting, robust dynamic membership is indeed possible. Peers either contribute to the training step, or they do not.

The PCCL model

PCCL is a library that provides fault-tolerant collective communication primitives designed for the public internet.

The PCCL model is simple:

Each peer runs the same program, and communicates with other peers by sending and receiving messages.
Peers declare a “shared state”, which is synchronized across all peers. If peers drift from the popular hash, the shared state is retransmitted.
If a peer fails during a collective communication operation, the operation can be retried with the remaining peers.
New peers can join at any time and will wait until they are accepted into the program by the existing peers
Newly joined peers will jump to the same line in the training loop where pre-existing peers have previously accepted new peers.

Why over the public internet?

Recent advancements in the distributed learning literature have shown that synchronizing worker gradients at each step is not necessary for convergence. Optimization strategies like DiLoCo, which synchronizes worker-local weight deltas only every N inner steps are competitive with naive DDP. Crucially, as the model size grows, the more the gap between DiLoCo and DDP shrinks. PCCL was developed to take advantage of what presents a new opportunity of scaling language model training in a distributed setting.

Free* Communication

PCCL allows one to easily implement schemes such as async DiLoCo, which implement one-step delayed parameter updates, which is a way to completely hide the communication from your reduce operation, as the next set of inner steps is computed concurrently. In the best case, the amount of inner steps is tuned such that the compute time matches your communication time precisely. This allows for the best balance of parallelism and communication frequency. Examples on recommended usage patterns and how to implement common distributed optimization strategies are available in the examples folder of the PCCL repository.

How fault tolerant is it?

TLDR: very

PCCL passes extensive long-running stress-tests on all major socket implementations (Linux, macOS, Windows WSA) with a high frequency training loop where peers are rapidly spawned and killed with completely random timing to provoke every possible race condition and or crash.

As long as application code follows best practices of error recovery / retry logic, there is no bad time to kill a PCCL peer, no matter if multiple concurrent collective communications operations are ongoing that need to be partially aborted, awaited and or re-tried, shared state synchronization or any of the other phases. The shared state is not lost. It continues to be advanced by the training loop through the remaining peers, even under heavy peer churn.

Topology optimization

PCCL can perform automatic topology optimization. This triggers bandwidth testing and subsequent construction of the optimal tour given the obtained cost edges.
If, for example, computers are colocated in the same datacenter, packets can often be locally delivered without ever bouncing the gateway. In this case, bandwidth is often within ~50GBit/s in most clouds. In such a scenario when utilizing topology optimization, the cost of “leaving” the data center is only payed twice. Suboptimal tours would pay this cost more frequently. Given this property, PCCL is not strictly restricted to transport over the internet and allows for seamless mixing of Ethernet-confined and public internet communication.

Benchmarks

PCCL can effectively utilize cross-continental long-fat-pipe links through the use of multiple concurrent pipelined all reduces distributing packets over a large connection pool. This helps to aggregate bandwidth from routers enforcing per-flow fair-queuing.

North America + Western Europe

North America Only

In our testing we observe PCCL achieve a throughput of 25 Gbit/s in a setup of 18 peers spread across North America and Western Europe.

Without the involvement of undersea links, bandwidth can be increased beyond this:

Europe West Only

In a less globally distributed setting, we can achieve even higher speeds:

Can it be used in HPC settings?

Despite the fact that PCCL does not directly optimize for high-performance local networks, PCCL remains competitive with PyTorch’s Gloo over Ethernet.

Usage in our prime framework

If you want to see how we use PCCL, refer to the open source prime repository, which is our production-ready open source implementation of DiLoCo and async DiLoCo.

‍

Eurus-2-PRIME

System Prompt

When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

Eurus-2-PRIME

[ASSESS]

We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.

[ADVANCE]

Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $

Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.

For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $

$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $

For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $

$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $

[ADVANCE]

Now, we can solve the system of equations for $P(x)$:
Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $

Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $

So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $

Next, we can solve the system of equations for $Q(x)$:
Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $

Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $

So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $

[ADVANCE]

Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $

So, the value of $P(0) + Q(0)$ is $\boxed{116}$.

INTELLECT-MATH-7B-SFT (ours)

System Prompt

Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B-SFT

So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:

$P(x) = 2x^2 + bx + c$

$Q(x) = -2x^2 + dx + e$

Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.

Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.

So, I can set up equations using these points.

First, for $P(x)$:

$P(16) = 2(16)^2 + b16 + c = 54$

Similarly, $P(20) = 2(20)^2 + b20 + c = 53$

Let me calculate $16^2$ and $20^2$:

$16^2 = 256$

$20^2 = 400$

So,

$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$

$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$

Now, I have two equations:

1) $16b + c = -458$

2) $20b + c = -747$

I can subtract equation 1 from equation 2 to eliminate $c$:

$(20b + c) - (16b + c) = -747 - (-458)$

$20b + c - 16b - c = -747 + 458$

$4b = -289$

So, $b = -289 / 4 = -72.25$

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:

$16(-72.25) + c = -458$

$-1156 + c = -458$

$c = -458 + 1156 = 698$

So, $P(x) = 2x^2 - 72.25x + 698$

Hmm, $P(0)$ would be $c$, which is $698$.

Now, let's do the same for $Q(x)$.

$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$

$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$

Now, equations for $Q(x)$:

1) $16d + e = 566$

2) $20d + e = 853$

Subtract equation 1 from equation 2:

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287$

$d = 287 / 4 = 71.75$

Now, plug $d$ back into equation 1:

$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$

So, $Q(x) = -2x^2 + 71.75x - 582$

Therefore, $Q(0) = e = -582$

Now, $P(0) + Q(0) = 698 + (-582) = 116$

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write $\boxed{116}$

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for $P(x)$:

$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$

$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$

Subtracting: $4b = -289 \Rightarrow b = -72.25$

Then, $c = -458 - 16*(-72.25) = -458 + 1156 = 698$

Similarly for $Q(x)$:

$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$

$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$

Subtracting: $4d = 287 \Rightarrow d = 71.75$

Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$

So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.

Seems correct.

Final Answer

$\boxed{116}$

INTELLECT-MATH-7B (ours)

System Prompt

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B

So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both are quadratic polynomials, I can express them in the general form:

$P(x) = 2x² + bx + c$

$Q(x) = -2x² + dx + e$

But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.

For $P(x)$:

$P(16) = 2*(16)² + b \cdot 16 + c = 54$

$P(20) = 2(20)² + b \cdot 20 + c = 53$

Similarly for $Q(x)$:

$Q(16) = -2(16)² + d \cdot 16 + e = 54$

$Q(20) = -2(20)² + d \cdot 20 + e = 53$

Let's compute the squares:

$16² = 256$

$20² = 400$

So for $P(x)$:

$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$

$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$

For $Q(x)$:

$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$

$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$

Now, I have two systems of linear equations:

For $P(x)$:

$16b + c = -458 \quad ...(1)$

$20b + c = -747 \quad ...(2)$

Subtract equation (1) from equation (2):

$(20b + c) - (16b + c) = -747 - (-458)$

$4b = -289 \Rightarrow b = -\frac{289}{4}$

Now plug $b$ back into equation (1):

$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$

So $P(x) = 2x² - \frac{289}{4}x + 698$

For $Q(x)$:

$16d + e = 566 \quad ...(3)$

$20d + e = 853 \quad ...(4)$

Subtract equation (3) from equation (4):

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287 \Rightarrow d = \frac{287}{4}$

Now plug $d$ back into equation (3):

$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$

So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.

Now, $P(0)=c=698$.

$Q(0)=e=-582$.

Therefore, $P(0)+Q(0)=698+(-582)=116$.

Final Answer

$\boxed{116}$.

Authors

Mike

Acknowledgements

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

Today, we’re excited to launch SYNTHETIC-2, our next-generation, open-source reasoning dataset and planetary-scale, pipeline-parallel decentralized inference run.Built on our peer-to-peer inference stack and powered by the new DeepSeek-R1-0528 model, SYNTHETIC-2 generates verified reasoning traces spanning the most comprehensive set of complex reinforcement-learning tasks and verifiers released to date.

INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning

We're excited to release INTELLECT-2, the first 32B parameter model trained via globally distributed reinforcement learning. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning language model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors.

Planetary-Scale Inference: Previewing our Peer-To-Peer Decentralized Inference Stack

We are excited to share a preview of our peer-to-peer decentralized inference stack — engineered for consumer GPUs and the 100ms latencies of the public internet—plus a research roadmap that scales it into a planetary-scale inference engine.

PCCL - Prime Collective Communications Library

Motivation

Why ML is different

The PCCL model

Why over the public internet?

Free* Communication

How fault tolerant is it?

Topology optimization

Benchmarks

North America + Western Europe

North America Only

Europe West Only

Can it be used in HPC settings?

Usage in our prime framework

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning

Planetary-Scale Inference: Previewing our Peer-To-Peer Decentralized Inference Stack

Related Posts

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning

Planetary-Scale Inference: Previewing our Peer-To-Peer Decentralized Inference Stack