Original article excerpt
Server-side extracted preview paragraphs from the original source.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
If you read our previous post on the landscape of async RL training, you already know the punchline. Every async RL library, regardless of how it spells "actor model" or which color its NCCL backend is painted, eventually trips over the same root: weight synchronization.
The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. This sits on the critical path whether you are running sync or async: a blocking transfer is wasted idle compute of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes "weights ready" and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time.
Fireworks put a very memorable number on this in their post Frontier RL Is Cheaper Than You Think: for a frontier 1T-parameter checkpoint at fp8 (their setting), a full snapshot is 1024 GiB, and that is what conventional wisdom says you have to ship every time you update your rollout fleet. That is the kind of number that gets people to start drawing diagrams with mega-clusters, RDMA fabrics, and dedicated cross-region links. Their measured average delta between adjacent checkpoints lands at 20.3 GiB, or 1.98% of the full model, and "more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints".
Cursor's Composer 2 report tells a parallel story. They run training and inference in different regions and stitch them together with a shared S3 bucket (their exact words), into which the trainer uploads compressed weight diffs every training step. Each cluster independently downloads and reconstructs from the shared delta chain, "requiring no direct connectivity to the training cluster". The two sides never speak to each other about parameters directly. The bucket is the wire.
Both papers agree on three things, and we want to repeat them slowly, because the rest of this post is essentially a faithful open source translation:
The only thing missing was a version of this story that you can pip install. So we wrote one.