Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by NVIDIA on Hugging Face
NVIDIA Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips. To adapt it to a specific domain, such as robot manipulation or a particular camera viewpoint, teams still need targeted fine-tuning.
Training robot policies requires demonstration data, but collecting real-robot trajectories is slow and expensive. Generating synthetic trajectories with a fine-tuned video world model offers a scalable alternative. However, full fine-tuning of a 2B-parameter model is expensive and risks catastrophic forgetting of general knowledge. LoRA and DoRA inject small trainable adapter modules into the frozen base model, reducing memory requirements while keeping the adapter files small and portable. This makes it practical to fine-tune on a single GPU and flexibly swap adapters for different domains at inference.
This guide walks through parameter-efficient fine-tuning of Cosmos Predict 2.5 with LoRA and DoRA, using the diffusers and accelerate libraries with support for both single- and multi-GPU training. We then show how to use the fine-tuned model to generate synthetic robot trajectories for downstream robot learning tasks.
After installing diffusers, navigate to examples/cosmos to explore the example code. We use the same datasets as the GR00T Dreams post-training recipe:
Download and preprocess the training and test datasets using download_and_preprocess_datasets.sh:
The eval dataset is a flat directory of paired .txt and .png files for the (prompt, image) pairs: