nanoG1 — ultra fast RL for robotics
nanoG1 is a walking policy for the Unitree G1 humanoid (29-DoF), trained
from scratch with pure RL — no demonstrations, no reference gait, no motion
capture — in 58.9 seconds on a single GPU. The first full walk in this project
took 6.1 hours; specializing the simulator (24×) and the learning recipe (16×) cut
that **375×** over six days.
- â–¶ Live demo (physics + policy in your browser): https://nanog1.com
- 💻 Code (train your own in <60s, one command): https://github.com/kingjulio8238/nanoG1
- 📑 How it works (the full story, in plots): https://juliansaks.com/feed/nanog1
What it is
- Task: velocity-command locomotion (track commanded forward/lateral velocity + yaw rate), legged-gym-style reward.
- Net: PufferLib
PufferNet— a MinGRU recurrent policy, hidden 128 × 3 layers, 163.9K parameters, continuous Gaussian head. - Physics: a G1-specialized CUDA engine, validated trajectory-by-trajectory against the MuJoCo C engine. dt 0.004 × decimation 5 (50 Hz control), Newton solver (2 iters / 3 line-search).
- Algorithm: PPO + V-trace + prioritized replay, Muon optimizer. Pure RL from scratch. The single biggest lever to sub-60s was a left↔right symmetry loss (it cut samples ~26% and smoothed the gait).
Headline numbers (one RTX PRO 6000)
| metric | value |
|---|---|
| time-to-walk | 58.9 s |
| samples-to-walk | 75M @ 1.28M SPS |
| cost | ~$0.17 |
Physics throughput (RTX PRO 6000, G1, identical settings): nanoG1 7.25M steps/s vs
mujoco_warp 4.0M (1.8×) / Genesis 2.3M / MJX 1.1M — and 8.5M in its production config.
Reproduce from a clean clone: modal run bench/bench_nanog1.py.
Run it
- In the browser: open the live demo above — drive the trained G1 with the arrow keys.
- Train your own:
bash speedrun.shin the repo — env → engine → train → quality gate (~$0.17, one GPU). - On a real Unitree G1:
deploy/runs this policy at 50 Hz over Unitree's low-level DDS interface (unitree_sdk2py).
I/O spec (to run inference)
- Observation (98-d, float32):
[0:3]base angular velocity ×0.25 ·[3:6]projected gravity (base frame) ·[6:9]command (vx, vy, yaw-rate) ·[9:38]joint positions − keyframe (29) ·[38:67]joint velocities ×0.05 (29) ·[67:96]previous action (29) ·[96:98]gait-phase clock sin/cos (period 40 control steps). - Action (29-d, float32, ∈ [-1,1]): joint-position targets
key_qpos + 0.25 · action, fed to a Unitree-gain PD controller. The 12 leg DoF are actuated; waist + arms are held at the home pose. - Joint order = the menagerie Unitree G1 actuator order (left leg, right leg, waist, left arm, right arm).
- The policy is recurrent (MinGRU): carry the hidden state across control steps; reset it only at episode start.
Loading the weights
nanoG1.bin is a flat float32 PufferNet weight blob (not safetensors). Load it with the
PufferLib inference path (vendor/PufferLib/src/puffernet.h: load_weights → make_puffernet).
For a complete, self-contained CPU/WASM example (physics + policy, no MuJoCo/CUDA needed at
inference) see web/g1_demo.c;
for real-robot inference via a small ctypes shim see
deploy/.
Built on PufferLib — its compile-per-robot specialization is what makes this speed possible. MIT licensed.