YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Cosmos3-Nano β€” G1 BrainCo Policy SFT (iter 1000)

Model: Cosmos3-Nano fine-tuned in policy mode on the Unitree G1 humanoid robot with BrainCo dexterous hands.

Given a single ego-camera image and a task description, the model simultaneously predicts:

  • Future video β€” 16 frames at 15 fps showing the robot completing the task
  • Actions β€” 26D joint-angle trajectory (7 left arm + 7 right arm + 6 left hand + 6 right hand)

No action input is required at inference time β€” the model generates actions from vision + language alone.


Model Details

Property Value
Base model Cosmos3-Nano (Qwen3-VL-8B backbone, ~30B params)
Fine-tuning mode Policy SFT (image + prompt β†’ video + actions)
Training checkpoint iter 1000
Training dataset G1 BrainCo β€” 8 manipulation tasks Γ— ~300 episodes = ~1,598 episodes
Action space 26 DOF: 7 left arm + 7 right arm + 6 left hand + 6 right hand (BrainCo)
Camera Ego-view from robot head (cam_left_high), 640Γ—480, 30fps
Action FPS 15 Hz
Chunk length 16 frames (1.07 seconds per chunk)
Action normalization Quantile (q01/q99 β†’ [-1, 1])
Training hardware 8Γ— NVIDIA A100/H100 80GB, FSDP
Training framework Cosmos3 (NVIDIA)

Tasks Trained On

Task Description
GraspOreo Grasp an Oreo cookie from the table
GraspRubiksCube Grasp a Rubik's cube from the table
PickApple Pick up an apple from the table
PickCharger Pick up a phone charger from the table
PickDoll Pick up a doll from the table
PickDrink Pick up a drink bottle from the table
PickTissues Pick up a tissue box from the table
PickToothpaste Pick up a toothpaste tube from the table

Repository Structure

Cosmos3-Nano-G1-BrainCo-PolicySFT/
β”œβ”€β”€ config.json                        # Model architecture config (HF format)
β”œβ”€β”€ checkpoint.json                    # Checkpoint metadata
β”œβ”€β”€ model.safetensors.index.json       # Shard index
β”œβ”€β”€ model-00001-of-00007.safetensors   # Weight shard 1/7 (~4.6 GB)
β”œβ”€β”€ model-00002-of-00007.safetensors   # Weight shard 2/7 (~5.0 GB)
β”œβ”€β”€ model-00003-of-00007.safetensors   # Weight shard 3/7 (~4.6 GB)
β”œβ”€β”€ model-00004-of-00007.safetensors   # Weight shard 4/7 (~5.0 GB)
β”œβ”€β”€ model-00005-of-00007.safetensors   # Weight shard 5/7 (~5.0 GB)
β”œβ”€β”€ model-00006-of-00007.safetensors   # Weight shard 6/7 (~4.0 GB)
β”œβ”€β”€ model-00007-of-00007.safetensors   # Weight shard 7/7 (~2.4 GB)
└── README.md

Total size: ~30 GB


How to Run β€” From Scratch

1. Clone and install Cosmos3

# Clone the Cosmos3 repo
git clone https://github.com/nvidia/cosmos3 cosmos3
cd cosmos3/packages/cosmos3

# Install with training extras (requires Python 3.11+, CUDA 13.0 or 12.8)
uv sync --all-extras --group=cu130-train
# Or for CUDA 12.8:
# uv sync --all-extras --group=cu128-train

# Required in NGC/PyTorch containers to avoid torch._C import errors:
export LD_LIBRARY_PATH=''

2. Download this model checkpoint

mkdir -p examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k

# Using huggingface_hub CLI:
pip install huggingface_hub
huggingface-cli download JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT \
    --local-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k

# Or in Python:
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT',
    local_dir='examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k'
)
"

3. Download the Wan2.2 VAE (required for video decoding)

pip install uvx
uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \
    --local-dir examples/checkpoints/wan22_vae

4. Prepare your initial observation image

Save your robot's ego-camera image as a JPEG or PNG file, e.g. my_obs.jpg.

  • Resolution: 640Γ—480 recommended (or any 4:3 ratio)
  • The camera should be mounted on the robot's head, looking down at the workspace

5. Create an inference input JSON

Save the following as my_policy_input.json:

[{
    "model_mode": "policy",
    "name": "my_task",
    "domain_name": "g1_brainco",
    "fps": 15,
    "image_size": 480,
    "action_chunk_size": 16,
    "raw_action_dim": 26,
    "view_point": "ego_view",
    "prompt": "{\"subjects\": [{\"description\": \"A Unitree G1 humanoid robot with articulated arms and dexterous hands\", \"action\": \"Pick up the red cup from the table\"}], \"background_setting\": \"An indoor workspace\", \"cinematography\": {\"camera_motion\": \"static\", \"framing\": \"ego-perspective view from the robot head camera looking at the workspace\", \"camera_angle\": \"ego\"}, \"actions\": [{\"time\": \"0:00-1s\", \"description\": \"Pick up the red cup from the table\"}], \"temporal_caption\": \"A Unitree G1 humanoid robot performs the task: Pick up the red cup from the table.\", \"resolution\": {\"H\": 480, \"W\": 640}, \"aspect_ratio\": \"4,3\", \"duration\": \"1s\", \"fps\": 15.0}",
    "seed": 42,
    "vision_path": "my_obs.jpg"
}]

Replace "Pick up the red cup from the table" with your actual task description.

6. Run single-chunk policy inference (1.07 seconds = 16 frames + 26D actions)

export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50014 \
    -m cosmos_framework.scripts.inference \
    --checkpoint-path examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
    --no-use-torch-compile \
    --no-guardrails \
    --output-dir outputs/my_policy_run \
    -i my_policy_input.json

Outputs (in outputs/my_policy_run/my_task/):

  • vision.mp4 β€” 16-frame predicted video at 15 fps
  • sample_outputs.json β€” predicted 26D joint-angle trajectory (normalized)

7. Chunked rollout β€” complete a full task (~5 seconds = 5 chunks Γ— 16 frames)

For a full task completion, use the autoregressive rollout script:

# Provide your initial frame as an MP4 (single frame or short clip)
cp my_obs.jpg /tmp/cosmos_outputs/g1_inference/mycustom_init.mp4  # or convert to mp4

export LD_LIBRARY_PATH=''
CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=50015 \
    examples/policy_rollout.py \
    --checkpoint-dir examples/checkpoints/Cosmos3-Nano-G1-PolicySFT-2k \
    --base-checkpoint-dir examples/checkpoints/Cosmos3-Nano \
    --output-dir outputs/policy_rollout_2k \
    --norm-stats-path examples/data/g1_brainco/action_stats.json \
    --n-chunks 5 \
    --tasks mycustom

This generates:

  • rollout.mp4 β€” full 80-frame video (~5.3 seconds at 15 fps)
  • actions_raw.npy β€” 80Γ—26 joint-angle array in radians
  • actions_raw.json β€” same, human-readable JSON
  • rollout_meta.json β€” metadata

Output Format

Predicted actions

Actions are in the raw joint-angle space (radians):

Columns 0–6:   Left arm (7 DOF)
Columns 7–13:  Right arm (7 DOF)
Columns 14–19: Left hand / BrainCo (6 DOF)
Columns 20–25: Right hand / BrainCo (6 DOF)

In the sample_outputs.json from the inference CLI, the action field contains normalized values in [-1, 1]. To convert to radians, apply:

import numpy as np, json

# Load action stats
stats = json.load(open("examples/data/g1_brainco/action_stats.json"))
q01 = np.array(stats["q01"])
q99 = np.array(stats["q99"])

# Load normalized actions from inference output
result = json.load(open("outputs/my_policy_run/my_task/sample_outputs.json"))
actions_norm = np.array(result["outputs"][0]["content"]["action"])  # [16, 26]

# Denormalize
actions_rad = q01 + (actions_norm + 1.0) / 2.0 * (q99 - q01)
print(f"Actions shape: {actions_rad.shape}")   # (16, 26)
print(f"Left arm (first step): {actions_rad[0, :7]}")

Action Normalization Stats

The action_stats.json file used to normalize/denormalize actions is included at examples/data/g1_brainco/action_stats.json in the Cosmos3 repo. Key fields:

  • q01 β€” 1st percentile per joint (26 values)
  • q99 β€” 99th percentile per joint (26 values)
  • mean, std, min, max β€” also available

Training Details

Parameter Value
Base model Cosmos3-Nano
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-6)
Learning rate 5e-6 (lower than FD mode β€” policy generates actions from scratch)
LR schedule Cosine warmup (100 steps) over 10k total iters
Batch size 1 per GPU Γ— 8 GPUs = 8 effective
Grad clip norm 0.1
Mixed precision bfloat16
Parallelism FSDP (8 GPUs fully sharded)
EMA rate=0.1
Trained params moe_gen, time_embedder, vae2llm, llm2vae (action-relevant heads)

Limitations

  • Trained on 1,000 iterations (early checkpoint β€” loss still decreasing). Better checkpoints at 5k/7k/10k will be released.
  • Generalizes well to the 8 trained task categories; zero-shot on very different tasks is limited.
  • Action predictions at iter 1000 may have some drift over long rollouts β€” shorter chunks (≀3) give best quality.
  • Dataset uses 28-minute shared video files (not individual episode files), which limits training speed.

Citation

If you use this model, please cite:

@misc{cosmos3-g1-policy-sft-2k,
  title  = {Cosmos3-Nano G1 BrainCo Policy SFT (iter 1000)},
  author = {JeffrinSam},
  year   = {2026},
  url    = {https://huggingface.co/JeffrinSam/Cosmos3-Nano-G1-BrainCo-PolicySFT},
  note   = {Fine-tuned from NVIDIA Cosmos3-Nano on Unitree G1 humanoid manipulation tasks}
}

Related

Downloads last month
15
Safetensors
Model size
15B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support