YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- Cosmos3-Nano β G1 BrainCo Policy SFT (iter 1000)
- Model Details
- Tasks Trained On
- Repository Structure
- How to Run β From Scratch
- 1. Clone and install Cosmos3
- 2. Download this model checkpoint
- 3. Download the Wan2.2 VAE (required for video decoding)
- 4. Prepare your initial observation image
- 5. Create an inference input JSON
- 6. Run single-chunk policy inference (1.07 seconds = 16 frames + 26D actions)
- 7. Chunked rollout β complete a full task (~5 seconds = 5 chunks Γ 16 frames)
- Output Format
- Action Normalization Stats
- Training Details
- Limitations
- Citation
- Related
- Model Details
Cosmos3-Nano β G1 BrainCo Policy SFT (iter 1000)
Model: Cosmos3-Nano fine-tuned in policy mode on the Unitree G1 humanoid robot with BrainCo dexterous hands.
Given a single ego-camera image and a task description, the model simultaneously predicts:
- Future video β 16 frames at 15 fps showing the robot completing the task
- Actions β 26D joint-angle trajectory (7 left arm + 7 right arm + 6 left hand + 6 right hand)
No action input is required at inference time β the model generates actions from vision + language alone.
Model Details
| Property | Value |
|---|---|
| Base model | Cosmos3-Nano (Qwen3-VL-8B backbone, ~30B params) |
| Fine-tuning mode | Policy SFT (image + prompt β video + actions) |
| Training checkpoint | iter 1000 |
| Training dataset | G1 BrainCo β 8 manipulation tasks Γ ~300 episodes = ~1,598 episodes |
| Action space | 26 DOF: 7 left arm + 7 right arm + 6 left hand + 6 right hand (BrainCo) |
| Camera | Ego-view from robot head (cam_left_high), 640Γ480, 30fps |
| Action FPS | 15 Hz |
| Chunk length | 16 frames (1.07 seconds per chunk) |
| Action normalization | Quantile (q01/q99 β [-1, 1]) |
| Training hardware | 8Γ NVIDIA A100/H100 80GB, FSDP |
| Training framework | Cosmos3 (NVIDIA) |
Tasks Trained On
| Task | Description |
|---|---|
| GraspOreo | Grasp an Oreo cookie from the table |
| GraspRubiksCube | Grasp a Rubik's cube from the table |
| PickApple | Pick up an apple from the table |
| PickCharger | Pick up a phone charger from the table |
| PickDoll | Pick up a doll from the table |
| PickDrink | Pick up a drink bottle from the table |
| PickTissues | Pick up a tissue box from the table |
| PickToothpaste | Pick up a toothpaste tube from the table |
Repository Structure
Cosmos3-Nano-G1-BrainCo-PolicySFT/
βββ config.json # Model architecture config (HF format)
βββ checkpoint.json # Checkpoint metadata
βββ model.safetensors.index.json # Shard index
βββ model-00001-of-00007.safetensors # Weight shard 1/7 (~4.6 GB)
βββ model-00002-of-00007.safetensors # Weight shard 2/7 (~5.0 GB)
βββ model-00003-of-00007.safetensors # Weight shard 3/7 (~4.6 GB)
βββ model-00004-of-00007.safetensors # Weight shard 4/7 (~5.0 GB)
βββ model-00005-of-00007.safetensors # Weight shard 5/7 (~5.0 GB)
βββ model-00006-of-00007.safetensors # Weight shard 6/7 (~4.0 GB)
βββ model-00007-of-00007.safetensors # Weight shard 7/7 (~2.4 GB)
βββ README.md
Total size: ~30 GB
How to Run β From Scratch
1. Clone and install Cosmos3
# Clone the Cosmos3 repo
git clone https://github.com/nvidia/cosmos3 cosmos3
cd cosmos3/packages/cosmos3
# Install with training extras (requires Python 3.11+, CUDA 13.0 or 12.8)
uv sync --all-extras --group=cu130-train
# Or for CUDA 12.8:
# uv sync --all-extras --group=cu128-train
# Required in NGC/PyTorch containers to avoid torch._C import errors:
export LD_LIBRARY_PATH=''
2. Download this model checkpoint
mkdir -p examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k
# Using huggingface_hub CLI:
pip install huggingface_hub
huggingface-cli download JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT \
--local-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k
# Or in Python:
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT',
local_dir='examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k'
)
"
3. Download the Wan2.2 VAE (required for video decoding)
pip install uvx
uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \
--local-dir examples/checkpoints/wan22_vae
4. Prepare your initial observation image
Save your robot's ego-camera image as a JPEG or PNG file, e.g. my_obs.jpg.
- Resolution: 640Γ480 recommended (or any 4:3 ratio)
- The camera should be mounted on the robot's head, looking down at the workspace
5. Create an inference input JSON
Save the following as my_policy_input.json:
[{
"model_mode": "policy",
"name": "my_task",
"domain_name": "g1_brainco",
"fps": 15,
"image_size": 480,
"action_chunk_size": 16,
"raw_action_dim": 26,
"view_point": "ego_view",
"prompt": "{\"subjects\": [{\"description\": \"A Unitree G1 humanoid robot with articulated arms and dexterous hands\", \"action\": \"Pick up the red cup from the table\"}], \"background_setting\": \"An indoor workspace\", \"cinematography\": {\"camera_motion\": \"static\", \"framing\": \"ego-perspective view from the robot head camera looking at the workspace\", \"camera_angle\": \"ego\"}, \"actions\": [{\"time\": \"0:00-1s\", \"description\": \"Pick up the red cup from the table\"}], \"temporal_caption\": \"A Unitree G1 humanoid robot performs the task: Pick up the red cup from the table.\", \"resolution\": {\"H\": 480, \"W\": 640}, \"aspect_ratio\": \"4,3\", \"duration\": \"1s\", \"fps\": 15.0}",
"seed": 42,
"vision_path": "my_obs.jpg"
}]
Replace "Pick up the red cup from the table" with your actual task description.
6. Run single-chunk policy inference (1.07 seconds = 16 frames + 26D actions)
export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50014 \
-m cosmos_framework.scripts.inference \
--checkpoint-path examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
--no-use-torch-compile \
--no-guardrails \
--output-dir outputs/my_policy_run \
-i my_policy_input.json
Outputs (in outputs/my_policy_run/my_task/):
vision.mp4β 16-frame predicted video at 15 fpssample_outputs.jsonβ predicted 26D joint-angle trajectory (normalized)
7. Chunked rollout β complete a full task (~5 seconds = 5 chunks Γ 16 frames)
For a full task completion, use the autoregressive rollout script:
# Provide your initial frame as an MP4 (single frame or short clip)
cp my_obs.jpg /tmp/cosmos_outputs/g1_inference/mycustom_init.mp4 # or convert to mp4
export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50015 \
examples/policy_rollout.py \
--checkpoint-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
--base-checkpoint-dir examples/checkpoints/Cosmos3-Nano \
--output-dir outputs/policy_rollout_2k \
--norm-stats-path examples/data/g1_brainco/action_stats.json \
--n-chunks 5 \
--tasks mycustom
This generates:
rollout.mp4β full 80-frame video (~5.3 seconds at 15 fps)actions_raw.npyβ 80Γ26 joint-angle array in radiansactions_raw.jsonβ same, human-readable JSONrollout_meta.jsonβ metadata
Output Format
Predicted actions
Actions are in the raw joint-angle space (radians):
Columns 0β6: Left arm (7 DOF)
Columns 7β13: Right arm (7 DOF)
Columns 14β19: Left hand / BrainCo (6 DOF)
Columns 20β25: Right hand / BrainCo (6 DOF)
In the sample_outputs.json from the inference CLI, the action field contains normalized values in [-1, 1]. To convert to radians, apply:
import numpy as np, json
# Load action stats
stats = json.load(open("examples/data/g1_brainco/action_stats.json"))
q01 = np.array(stats["q01"])
q99 = np.array(stats["q99"])
# Load normalized actions from inference output
result = json.load(open("outputs/my_policy_run/my_task/sample_outputs.json"))
actions_norm = np.array(result["outputs"][0]["content"]["action"]) # [16, 26]
# Denormalize
actions_rad = q01 + (actions_norm + 1.0) / 2.0 * (q99 - q01)
print(f"Actions shape: {actions_rad.shape}") # (16, 26)
print(f"Left arm (first step): {actions_rad[0, :7]}")
Action Normalization Stats
The action_stats.json file used to normalize/denormalize actions is included at examples/data/g1_brainco/action_stats.json in the Cosmos3 repo. Key fields:
q01β 1st percentile per joint (26 values)q99β 99th percentile per joint (26 values)mean,std,min,maxβ also available
Training Details
| Parameter | Value |
|---|---|
| Base model | Cosmos3-Nano |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, Ξ΅=1e-6) |
| Learning rate | 5e-6 (lower than FD mode β policy generates actions from scratch) |
| LR schedule | Cosine warmup (100 steps) over 10k total iters |
| Batch size | 1 per GPU Γ 8 GPUs = 8 effective |
| Grad clip norm | 0.1 |
| Mixed precision | bfloat16 |
| Parallelism | FSDP (8 GPUs fully sharded) |
| EMA | rate=0.1 |
| Trained params | moe_gen, time_embedder, vae2llm, llm2vae (action-relevant heads) |
Limitations
- Trained on 1,000 iterations (early checkpoint β loss still decreasing). Better checkpoints at 5k/7k/10k will be released.
- Generalizes well to the 8 trained task categories; zero-shot on very different tasks is limited.
- Action predictions at iter 1000 may have some drift over long rollouts β shorter chunks (β€3) give best quality.
- Dataset uses 28-minute shared video files (not individual episode files), which limits training speed.
Citation
If you use this model, please cite:
@misc{cosmos3-g1-policy-sft-2k,
title = {Cosmos3-Nano G1 BrainCo Policy SFT (iter 1000)},
author = {JeffrinSam},
year = {2026},
url = {https://huggingface.co/JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT},
note = {Fine-tuned from NVIDIA Cosmos3-Nano on Unitree G1 humanoid manipulation tasks}
}
Related
- Cosmos3-Nano-G1-BrainCo-ActionSFT β Forward-dynamics mode (iter 800), requires action input
- NVIDIA Cosmos3 β Base framework
- Downloads last month
- 15