PRIME Intellect
INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning
Announcements
May 11, 2025

INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning

We're excited to release INTELLECT-2, the first 32B parameter model trained via globally distributed reinforcement learning. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning language model using fully asynchronous RL across a dynamic, heterogeneous swarm of compute contributors.

To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers.

Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B.

We open-source INTELLECT-2 along with our code and data, hoping to enable more open research in the field of globally-distributed training

Image

Paradigm Shift for Distributed Training

Test-time compute scaling with reinforcement learning has emerged as a new scaling axis for large language models (LLMs), enabling improvements by allowing models to spend more time reasoning.

However, reinforcement learning training is typically centralized, requiring large clusters of co-located GPUs and fast interconnect speeds. With INTELLECT-2, we showcase a paradigm shift: reinforcement learning is inherently more asynchronous and well suited for globally distributed compute.

Training Infrastructure

We introduce the following key open-source infrastructure components for training INTELLECT-2:

  • PRIME-RL:
    • Fully asynchronous reinforcement learning framework designed for multi-datacenter training. Decouples rollout generation, model training, and weight broadcasting. It enables training across heterogeneous, unreliable networks.
    • The trainer implementation uses PyTorch FSDP2, inference uses vLLM and the verifiers use the GENESYS schema introduced in SYNTHETIC-1.
  • SHARDCAST: A library for distributing large files via a HTTP-based tree-topology network that efficiently propagates updated model weights to the inference workers.

Image

  • TOPLOC:
    • A locality-sensitive hashing scheme for efficient verifiable inference. It detects tampering or precision changes in model inference and works reliably across nondeterministic GPU hardware.
    • Inference workers generate the rollouts, these rollout files are uploaded via signed URLs, an on-chain event triggers TOPLOC validators to check them; accepted files feed the trainer, while invalid ones remove the submitting node from the pool.

Image

Training Recipe

  • Training Data & Rewards:
    • 285k verifiable tasks (math & coding) from NuminaMath-1.5, Deepscaler, and SYNTHETIC-1.
    • Binary task reward + length reward lets users budget thinking tokens at inference time.
  • Two-step asynchronous RL: The broadcast of new policy weights is fully overlapped with ongoing inference and training—eliminating the communication bottleneck

Image

  • Two-Sided GRPO Clipping: Stabilizes training by mitigating gradient spikes with two-sided token probability ratio clipping.

Image

  • Advanced Data Filtering: Combines offline and online filtering to select challenging tasks, significantly enhancing model learning efficiency.

Image

  • Aggressive Gradient Clipping: Addresses escalating gradient norms at scale, providing improved training stability.

Experiments

We report results from two main experiments: TARGET-SHORT, an experimental run with short target lengths to train an efficient reasoning model, and, TARGET-LONG, our main run with longer target lengths.

  • Compute Utilization: During the two main experiments, we successfully overlapped communication with computation through two-step asynchronous reinforcement learning.
  • Reward Trajectories:
    • Throughout training, we saw significant improvements of our task rewards, indicating that the model improved its performance on our mathematics and coding problems. We also saw a reduction of length penalties, but a much slower one than during our ablation experiments

  • Benchmark Performance: We were able to increase the performance of QwQ-32B on mathematics and coding benchmarks.

Image

  • Overall, as QwQ-32B was already extensively trained with RL, it was difficult to obtain huge amounts of generalized improvement on benchmarks beyond our improvements on the training dataset. To see stronger improvements, it is likely that better base models such as the now available Qwen3, or higher quality datasets and RL environments are needed.

Future Work

INTELLECT-2 is a first step towards open frontier reasoning models trained in a globally-distributed fashion. Over the coming months, we'll work on:

  • Increasing the Ratio of Inference to Training Compute
    • Inference is embarrassingly parallel and communication-free, so more complex RL environments that devote more FLOPs to inference are a natural fit for low-communication training.
  • Tool Calls & Multi-Turn RL
    • To fully leverage inference-time compute for scientific and research progress, we need to give models built-in tools in its reasoning chain—web search, Python interpreters, etc.
  • Crowdsourcing RL Tasks & Environments
    • We believe open-source has a unique advantage here. Distributed RL is still in its early days, and with the right community and contributions, open-source AI can outpace the closed labs.
  • Model Merging & DiLoCo
    • Fuse independently trained RL models—either at the end or continuously via DiLoCo—to create a single unified model, scaling distributed RL to a lot more compute.

Check out the technical reports for more details: primeintellect.ai/intellect-2

Join Us

INTELLECT-2 demonstrates that globally distributed RL works. Now, we're focusing on tool-assisted reasoning, crowdsourcing higher-quality data, and optimizing our infrastructure and training recipe to build frontier open models.

Join us if you want to build open source and democratized AGI with us.