silent โ€” JEPA world model that plays predator by listening

A 13M-parameter Joint Embedding Predictive Architecture (JEPA) trained to predict next-step audio embeddings on a custom predator-prey environment. The predator senses the world through four cardioid microphones (N/E/S/W) on its body and chooses thrust + sonar ping actions to hunt the player.

Architecture

  • ViT-Tiny encoder (4-channel input, trained from scratch, ~6M params)
  • Linear action encoder (frameskip x 3 -> 192)
  • 6-layer AR causal transformer predictor with AdaLN-zero conditioning
  • 192 -> 2048 -> 192 projector MLP with BatchNorm
  • SIGReg regularizer on projected embeddings
  • Jointly-trained state head MLP (192 -> 256 -> 256 -> 8) at lambda=10

Total: ~13M params. Runs at ~10 Hz on a single shared CPU vCPU.

Files

File Purpose
silent_v1_3e_ep030.pt Shipping checkpoint -- joint DexWM, lambda=10
3e_ep030_head_uniform.pt Post-hoc state head for planner CEM cost

Quick start

pip install torch torchvision timm einops fastapi uvicorn websockets \
    librosa pymunk h5py pygame scipy

# Download checkpoints
huggingface-cli download sotoalt/silent --local-dir checkpoints/

# Clone the code
git clone https://github.com/SotoAlt/silent.git
cd silent

# Run the inference server
python -m world_model.infer_silent_env \
    --jepa-ckpt checkpoints/silent_v1_3e_ep030.pt \
    --jepa-head checkpoints/3e_ep030_head_uniform.pt \
    --host 0.0.0.0 --port 8801

# Open http://localhost:8801/ in a browser. WASD to move, space to voice.

Training

The full pipeline (data generation, pure-LeWM smoke test, preflight v2 probe, joint DexWM validation gate, full 100-epoch run, post-hoc head, ship audit) is documented in the research journal and the README.

Related work

  • LeWM (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026) - arxiv 2603.19312
  • DexWM - arxiv 2512.13644 (the joint state-head technique)
  • V-JEPA 2-AC (FAIR, 2026)

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading