Cosmos3-Nano-G1-BrainCo-ActionSFT

Cosmos3-Nano fine-tuned for Unitree G1 + BrainCo forward-dynamics prediction

This model is a supervised fine-tuned (SFT) version of NVIDIA Cosmos3-Nano trained on the G1 BrainCo manipulation dataset. Given an initial camera frame and a task prompt, it predicts future video frames and the robot's 26-DOF joint-angle trajectory simultaneously.

Model Details

Property	Value
Base model	nvidia/Cosmos3-Nano
Fine-tuning mode	`forward_dynamics`
Robot embodiment	Unitree G1 + BrainCo dexterous hands
Action space	26D joint angles (7L-arm + 7R-arm + 6L-hand + 6R-hand)
Domain ID	30 (`g1_brainco`)
Training iterations	800
Checkpoint saved every	200 iterations
Learning rate	1e-5
Optimizer	AdamW (β=0.9/0.95, ε=1e-6)
Precision	bfloat16
Action loss weight	10.0
EMA	enabled (rate=0.1)
LoRA	disabled (full fine-tune)
Hardware	8× NVIDIA L40 (49 GB)

Training Data

The model was trained on 8 manipulation tasks from the G1 BrainCo LeRobot dataset:

Task	Episodes	Frames
GraspOreo	201	~198K
GraspRubiksCube	197	~130K
PickApple	200	~120K
PickCharger	200	~150K
PickDoll	200	~276K
PickDrink	201	~143K
PickTissues	206	~198K
PickToothpaste	193	~312K

Dataset format: LeRobot v3.0 — 30 Hz, 4 camera views (left/right high + left/right wrist), 26D absolute joint angles.

Action normalization: quantile (q01/q99 → [-1, 1]) per joint dimension.

How to Use

Setup

git clone https://huggingface.co/jfgpt/Cosmos3-Nano-G1-BrainCo-ActionSFT
pip install cosmos-framework  # or: uv sync --all-extras --group=cu130
export LD_LIBRARY_PATH=''

Forward-Dynamics Inference

Given an initial video clip and an action sequence, predicts future video + next actions:

Input JSON (my_input.json):

{
  "model_mode": "forward_dynamics",
  "name": "pick_apple",
  "domain_name": "g1_brainco",
  "fps": 5,
  "image_size": 480,
  "action_chunk_size": 16,
  "raw_action_dim": 26,
  "view_point": "ego_view",
  "prompt": "{\"subjects\":[{\"description\":\"A Unitree G1 humanoid robot with articulated arms and dexterous hands\",\"action\":\"Pick up an apple from the table\"}],\"background_setting\":\"An indoor workspace\",\"cinematography\":{\"camera_motion\":\"static\",\"framing\":\"top-down wide-angle view\",\"camera_angle\":\"overhead\"}}",
  "seed": 42,
  "vision_path": "/path/to/initial_clip.mp4",
  "action_path": "/path/to/initial_actions.json"
}

initial_actions.json — list of 16 × 26D joint-angle vectors (raw, before normalization):

[
  [0.12, -0.05, -0.28, 0.19, 0.69, -0.02, 0.43, 0.17, 0.08, 0.01, 0.31, 0.01, -0.43, -0.16, 0.37, 0.45, 0.21, 0.27, 0.29, 0.33, 0.0, 0.66, 0.18, 0.29, 0.28, 0.25],
  ...
]

Run:

torchrun --nproc_per_node=4 \
  -m cosmos_framework.scripts.inference \
  --checkpoint-path /path/to/Cosmos3-Nano-G1-BrainCo-ActionSFT \
  --parallelism-preset latency \
  --no-guardrails \
  --output-dir outputs/ \
  -i my_input.json

Outputs (in outputs/pick_apple/):

vision.mp4 — predicted future video frames
action.json — predicted 16-step joint-angle trajectory (normalized)

Joint Name Order

0:  kLeftShoulderPitch     7:  kRightShoulderPitch
1:  kLeftShoulderRoll      8:  kRightShoulderRoll
2:  kLeftShoulderYaw       9:  kRightShoulderYaw
3:  kLeftElbow             10: kRightElbow
4:  kLeftWristRoll         11: kRightWristRoll
5:  kLeftWristPitch        12: kRightWristPitch
6:  kLeftWristYaw          13: kRightWristYaw
14: kLeftHandThumb         20: kRightHandThumb
15: kLeftHandThumbAux      21: kRightHandThumbAux
16: kLeftHandIndex         22: kRightHandIndex
17: kLeftHandMiddle        23: kRightHandMiddle
18: kLeftHandRing          24: kRightHandRing
19: kLeftHandPinky         25: kRightHandPinky

Action Normalization Stats

Use examples/data/g1_brainco/action_stats.json from the training repo for denormalization:

import json, numpy as np

stats = json.load(open("action_stats.json"))["global"]
q01 = np.array(stats["q01"])
q99 = np.array(stats["q99"])

def denormalize(normalized_action):
    """Convert model output [-1, 1] back to raw joint angles (radians)."""
    return (normalized_action + 1.0) / 2.0 * (q99 - q01) + q01

Inference Results (iter 800)

Evaluated on the held-out last episode of each task. Actions predicted in 16-step chunks at 5 fps.

Task	MAE (rad)	RMSE (rad)	Max Err (rad)
GraspOreo	0.3293	0.4673	1.2437
GraspRubiksCube	0.3529	0.4995	1.1093
PickApple	0.1932	0.2827	0.6504
PickCharger	0.3924	0.5281	1.3410
PickDoll	0.4292	0.5131	0.8699
PickDrink	0.3487	0.4535	1.0586
PickTissues	0.2121	0.3279	0.9474
PickToothpaste	0.3759	0.4559	0.8488
Average	0.3292	0.4410	—

Note: This is an early checkpoint (800 iterations). Results improve significantly with more training (recommended: 2000–5000 iterations).

Switching to Policy Mode

To use this checkpoint as a closed-loop policy (image + prompt → video + action, no action input needed), change model_mode to "policy" and remove action_path. For policy SFT training from this checkpoint, see the Cosmos3 policy fine-tuning guide.

Training Recipe

# examples/toml/sft_config/g1_action_sft_nano.toml
[job]
experiment = "g1_action_sft_nano"
project    = "cosmos3_g1"

[optimizer]
lr = 1.0e-5
keys_to_select = ["moe_gen", "time_embedder", "vae2llm", "llm2vae"]

[trainer]
max_iter = 1000

[checkpoint]
save_iter = 200
load_path = "${oc.env:BASE_CHECKPOINT_PATH}"

Launch:

export BASE_CHECKPOINT_PATH=examples/checkpoints/Cosmos3-Nano
export WAN_VAE_PATH=examples/checkpoints/wan22_vae/Wan2.2_VAE.pth
export G1_DATASETS_ROOT=/path/to/cosmos3g1dataset
export G1_NORM_STATS_PATH=examples/data/g1_brainco/action_stats.json

bash examples/launch_sft_g1_nano.sh

Citation

If you use this model, please cite the base Cosmos3 work:

@misc{cosmos3,
  title  = {Cosmos3: World Foundation Model for Physical AI},
  author = {NVIDIA},
  year   = {2026},
  url    = {https://github.com/nvidia-cosmos/cosmos3}
}

License

This model inherits the NVIDIA Open Model License (OpenMDW-1.1) from the base Cosmos3-Nano checkpoint.

Downloads last month: 11

Safetensors

Model size

15B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support