Cosmos3-Nano · GR-1 · Diffusion-Forcing — iter 4000

A mid-training research checkpoint (iteration 4000 / 20000) of NVIDIA Cosmos3-Nano finetuned on the GR-1 humanoid manipulation dataset in the native long-horizon "diffusion forcing" regime (temporal-causal, three-way attention).

⚠️ This is an interim checkpoint from an in-progress run, published for evaluation / reproducibility — not a final or converged model.

What this is

  • Base: nvidia/Cosmos3-Nano (two-tower Omni-MoT World Foundation Model; Qwen3-VL-8B language tower + Wan2.2 VAE).
  • Task: GR-1 forward-dynamics world modeling — predict future video latents conditioned on the first frame + the 44-DoF joint-action sequence.
  • Regime: diffusion forcing — each latent video frame is noised at an independent σ, with temporal-causal attention over generation supertokens (clean past conditions noisy future), the basis for stable autoregressive rollout. (causal_training_strategy=diffusion_forcing, video_temporal_causal=True, joint_attn_implementation=three_way.)
  • Dataset: periphanes/gr1_mg_gr00t_300_new (GR-1 LeRobot v2.0).

Training summary

Iteration 4000 / 20000
Hardware 8× B200 (FSDP)
Packing token-budget, 45056 tokens/seq
LR 2e-4
Dataset mode forward_dynamics (video loss active; actions are clean conditioning)
Latent geometry 17 RGB frames → tcf=4 → T_latent = 5; 256px → ÷16 → ÷2 patch → 8×8 = 64 patches/frame
Loss at iter 4000 ~0.13 (video flow-matching)

Format & contents

PyTorch Distributed Checkpoint (DCP), FSDP-sharded — model weights only (no optimizer state). The model/ folder contains 8 shards __{0..7}_0.distcp (~11.4 GB each, ~85 GB total) plus the required .metadata.

model/__0_0.distcp … __7_0.distcp
model/.metadata
training_config.yaml

Loading

Requires the NVIDIA Cosmos Framework (cosmos_framework). Place the downloaded model/ folder as the model sub-directory of a checkpoint dir and load it with the framework's DCP loader (same path used for resume/eval).

Note: the temporal-causal generation path requires NATTEN ≥ 0.21.9.dev0 (natten.varlen). If that wheel is unavailable, this checkpoint was trained with an opt-in flex_attention block-causal shim (COSMOS3_NATTEN_VARLEN_SHIM=1) that serves the same (full-window, temporal-causal) attention without NATTEN's varlen kernels.

License

Derived from nvidia/Cosmos3-Nano; usage is subject to the base model's NVIDIA Cosmos license terms. Refer to the base model card for the authoritative license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for periphanes/cosmos3-nano-gr1-difforce-4k

Finetuned
(11)
this model