nanoG1 — ultra fast RL for robotics

nanoG1 is a walking policy for the Unitree G1 humanoid (29-DoF), trained from scratch with pure RL — no demonstrations, no reference gait, no motion capture — in 58.9 seconds on a single GPU. The first full walk in this project took 6.1 hours; specializing the simulator (24×) and the learning recipe (16×) cut that **375×** over six days.

What it is

  • Task: velocity-command locomotion (track commanded forward/lateral velocity + yaw rate), legged-gym-style reward.
  • Net: PufferLib PufferNet — a MinGRU recurrent policy, hidden 128 × 3 layers, 163.9K parameters, continuous Gaussian head.
  • Physics: a G1-specialized CUDA engine, validated trajectory-by-trajectory against the MuJoCo C engine. dt 0.004 × decimation 5 (50 Hz control), Newton solver (2 iters / 3 line-search).
  • Algorithm: PPO + V-trace + prioritized replay, Muon optimizer. Pure RL from scratch. The single biggest lever to sub-60s was a left↔right symmetry loss (it cut samples ~26% and smoothed the gait).

Headline numbers (one RTX PRO 6000)

metric value
time-to-walk 58.9 s
samples-to-walk 75M @ 1.28M SPS
cost ~$0.17

Physics throughput (RTX PRO 6000, G1, identical settings): nanoG1 7.25M steps/s vs mujoco_warp 4.0M (1.8×) / Genesis 2.3M / MJX 1.1M — and 8.5M in its production config. Reproduce from a clean clone: modal run bench/bench_nanog1.py.

Run it

  • In the browser: open the live demo above — drive the trained G1 with the arrow keys.
  • Train your own: bash speedrun.sh in the repo — env → engine → train → quality gate (~$0.17, one GPU).
  • On a real Unitree G1: deploy/ runs this policy at 50 Hz over Unitree's low-level DDS interface (unitree_sdk2py).

I/O spec (to run inference)

  • Observation (98-d, float32): [0:3] base angular velocity ×0.25 · [3:6] projected gravity (base frame) · [6:9] command (vx, vy, yaw-rate) · [9:38] joint positions − keyframe (29) · [38:67] joint velocities ×0.05 (29) · [67:96] previous action (29) · [96:98] gait-phase clock sin/cos (period 40 control steps).
  • Action (29-d, float32, ∈ [-1,1]): joint-position targets key_qpos + 0.25 · action, fed to a Unitree-gain PD controller. The 12 leg DoF are actuated; waist + arms are held at the home pose.
  • Joint order = the menagerie Unitree G1 actuator order (left leg, right leg, waist, left arm, right arm).
  • The policy is recurrent (MinGRU): carry the hidden state across control steps; reset it only at episode start.

Loading the weights

nanoG1.bin is a flat float32 PufferNet weight blob (not safetensors). Load it with the PufferLib inference path (vendor/PufferLib/src/puffernet.h: load_weights → make_puffernet). For a complete, self-contained CPU/WASM example (physics + policy, no MuJoCo/CUDA needed at inference) see web/g1_demo.c; for real-robot inference via a small ctypes shim see deploy/.


Built on PufferLib — its compile-per-robot specialization is what makes this speed possible. MIT licensed.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading