YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Cosmos3-Nano — G1 BrainCo Policy SFT (iter 1000)

Model: Cosmos3-Nano fine-tuned in policy mode on the Unitree G1 humanoid robot with BrainCo dexterous hands.

Given a single ego-camera image and a task description, the model simultaneously predicts:

Future video — 16 frames at 15 fps showing the robot completing the task
Actions — 26D joint-angle trajectory (7 left arm + 7 right arm + 6 left hand + 6 right hand)

No action input is required at inference time — the model generates actions from vision + language alone.

Model Details

Property	Value
Base model	Cosmos3-Nano (Qwen3-VL-8B backbone, ~30B params)
Fine-tuning mode	Policy SFT (image + prompt → video + actions)
Training checkpoint	iter 1000
Training dataset	G1 BrainCo — 8 manipulation tasks × ~300 episodes = ~1,598 episodes
Action space	26 DOF: 7 left arm + 7 right arm + 6 left hand + 6 right hand (BrainCo)
Camera	Ego-view from robot head (`cam_left_high`), 640×480, 30fps
Action FPS	15 Hz
Chunk length	16 frames (1.07 seconds per chunk)
Action normalization	Quantile (q01/q99 → [-1, 1])
Training hardware	8× NVIDIA A100/H100 80GB, FSDP
Training framework	Cosmos3 (NVIDIA)

Tasks Trained On

Task	Description
GraspOreo	Grasp an Oreo cookie from the table
GraspRubiksCube	Grasp a Rubik's cube from the table
PickApple	Pick up an apple from the table
PickCharger	Pick up a phone charger from the table
PickDoll	Pick up a doll from the table
PickDrink	Pick up a drink bottle from the table
PickTissues	Pick up a tissue box from the table
PickToothpaste	Pick up a toothpaste tube from the table

Repository Structure

Cosmos3-Nano-G1-BrainCo-PolicySFT/
├── config.json                        # Model architecture config (HF format)
├── checkpoint.json                    # Checkpoint metadata
├── model.safetensors.index.json       # Shard index
├── model-00001-of-00007.safetensors   # Weight shard 1/7 (~4.6 GB)
├── model-00002-of-00007.safetensors   # Weight shard 2/7 (~5.0 GB)
├── model-00003-of-00007.safetensors   # Weight shard 3/7 (~4.6 GB)
├── model-00004-of-00007.safetensors   # Weight shard 4/7 (~5.0 GB)
├── model-00005-of-00007.safetensors   # Weight shard 5/7 (~5.0 GB)
├── model-00006-of-00007.safetensors   # Weight shard 6/7 (~4.0 GB)
├── model-00007-of-00007.safetensors   # Weight shard 7/7 (~2.4 GB)
└── README.md

Total size: ~30 GB

How to Run — From Scratch

1. Clone and install Cosmos3

# Clone the Cosmos3 repo
git clone https://github.com/nvidia/cosmos3 cosmos3
cd cosmos3/packages/cosmos3

# Install with training extras (requires Python 3.11+, CUDA 13.0 or 12.8)
uv sync --all-extras --group=cu130-train
# Or for CUDA 12.8:
# uv sync --all-extras --group=cu128-train

# Required in NGC/PyTorch containers to avoid torch._C import errors:
export LD_LIBRARY_PATH=''

2. Download this model checkpoint

mkdir -p examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k

# Using huggingface_hub CLI:
pip install huggingface_hub
huggingface-cli download JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT \
    --local-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k

# Or in Python:
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT',
    local_dir='examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k'
)
"

3. Download the Wan2.2 VAE (required for video decoding)

pip install uvx
uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \
    --local-dir examples/checkpoints/wan22_vae

4. Prepare your initial observation image

Save your robot's ego-camera image as a JPEG or PNG file, e.g. my_obs.jpg.

Resolution: 640×480 recommended (or any 4:3 ratio)
The camera should be mounted on the robot's head, looking down at the workspace

5. Create an inference input JSON

Save the following as my_policy_input.json:

[{
    "model_mode": "policy",
    "name": "my_task",
    "domain_name": "g1_brainco",
    "fps": 15,
    "image_size": 480,
    "action_chunk_size": 16,
    "raw_action_dim": 26,
    "view_point": "ego_view",
    "prompt": "{\"subjects\": [{\"description\": \"A Unitree G1 humanoid robot with articulated arms and dexterous hands\", \"action\": \"Pick up the red cup from the table\"}], \"background_setting\": \"An indoor workspace\", \"cinematography\": {\"camera_motion\": \"static\", \"framing\": \"ego-perspective view from the robot head camera looking at the workspace\", \"camera_angle\": \"ego\"}, \"actions\": [{\"time\": \"0:00-1s\", \"description\": \"Pick up the red cup from the table\"}], \"temporal_caption\": \"A Unitree G1 humanoid robot performs the task: Pick up the red cup from the table.\", \"resolution\": {\"H\": 480, \"W\": 640}, \"aspect_ratio\": \"4,3\", \"duration\": \"1s\", \"fps\": 15.0}",
    "seed": 42,
    "vision_path": "my_obs.jpg"
}]

Replace "Pick up the red cup from the table" with your actual task description.

6. Run single-chunk policy inference (1.07 seconds = 16 frames + 26D actions)

export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50014 \
    -m cosmos_framework.scripts.inference \
    --checkpoint-path examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
    --no-use-torch-compile \
    --no-guardrails \
    --output-dir outputs/my_policy_run \
    -i my_policy_input.json

Outputs (in outputs/my_policy_run/my_task/):

vision.mp4 — 16-frame predicted video at 15 fps
sample_outputs.json — predicted 26D joint-angle trajectory (normalized)

7. Chunked rollout — complete a full task (~5 seconds = 5 chunks × 16 frames)

For a full task completion, use the autoregressive rollout script:

# Provide your initial frame as an MP4 (single frame or short clip)
cp my_obs.jpg /tmp/cosmos_outputs/g1_inference/mycustom_init.mp4  # or convert to mp4

export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50015 \
    examples/policy_rollout.py \
    --checkpoint-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
    --base-checkpoint-dir examples/checkpoints/Cosmos3-Nano \
    --output-dir outputs/policy_rollout_2k \
    --norm-stats-path examples/data/g1_brainco/action_stats.json \
    --n-chunks 5 \
    --tasks mycustom

This generates:

rollout.mp4 — full 80-frame video (~5.3 seconds at 15 fps)
actions_raw.npy — 80×26 joint-angle array in radians
actions_raw.json — same, human-readable JSON
rollout_meta.json — metadata

Output Format

Predicted actions

Actions are in the raw joint-angle space (radians):

Columns 0–6:   Left arm (7 DOF)
Columns 7–13:  Right arm (7 DOF)
Columns 14–19: Left hand / BrainCo (6 DOF)
Columns 20–25: Right hand / BrainCo (6 DOF)

In the sample_outputs.json from the inference CLI, the action field contains normalized values in [-1, 1]. To convert to radians, apply:

import numpy as np, json

# Load action stats
stats = json.load(open("examples/data/g1_brainco/action_stats.json"))
q01 = np.array(stats["q01"])
q99 = np.array(stats["q99"])

# Load normalized actions from inference output
result = json.load(open("outputs/my_policy_run/my_task/sample_outputs.json"))
actions_norm = np.array(result["outputs"][0]["content"]["action"])  # [16, 26]

# Denormalize
actions_rad = q01 + (actions_norm + 1.0) / 2.0 * (q99 - q01)
print(f"Actions shape: {actions_rad.shape}")   # (16, 26)
print(f"Left arm (first step): {actions_rad[0, :7]}")

Action Normalization Stats

The action_stats.json file used to normalize/denormalize actions is included at examples/data/g1_brainco/action_stats.json in the Cosmos3 repo. Key fields:

q01 — 1st percentile per joint (26 values)
q99 — 99th percentile per joint (26 values)
mean, std, min, max — also available

Training Details

Parameter	Value
Base model	Cosmos3-Nano
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-6)
Learning rate	5e-6 (lower than FD mode — policy generates actions from scratch)
LR schedule	Cosine warmup (100 steps) over 10k total iters
Batch size	1 per GPU × 8 GPUs = 8 effective
Grad clip norm	0.1
Mixed precision	bfloat16
Parallelism	FSDP (8 GPUs fully sharded)
EMA	rate=0.1
Trained params	`moe_gen`, `time_embedder`, `vae2llm`, `llm2vae` (action-relevant heads)

Limitations

Trained on 1,000 iterations (early checkpoint — loss still decreasing). Better checkpoints at 5k/7k/10k will be released.
Generalizes well to the 8 trained task categories; zero-shot on very different tasks is limited.
Action predictions at iter 1000 may have some drift over long rollouts — shorter chunks (≤3) give best quality.
Dataset uses 28-minute shared video files (not individual episode files), which limits training speed.

Citation

If you use this model, please cite:

@misc{cosmos3-g1-policy-sft-2k,
  title  = {Cosmos3-Nano G1 BrainCo Policy SFT (iter 1000)},
  author = {JeffrinSam},
  year   = {2026},
  url    = {https://huggingface.co/JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT},
  note   = {Fine-tuned from NVIDIA Cosmos3-Nano on Unitree G1 humanoid manipulation tasks}
}

Cosmos3-Nano-G1-BrainCo-ActionSFT — Forward-dynamics mode (iter 800), requires action input
NVIDIA Cosmos3 — Base framework

Downloads last month: 15

Safetensors

Model size

15B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

JeffrinSam
/

Cosmos3-Nano-G1-BrainCo-PolicySFT