YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
RoboMind VLA β Robot Locomotion Reward Judge
A vision-language reward model for robot locomotion quality assessment, fine-tuned from MiniCPM-V-2.6 using LoRA. Combines VLM understanding with physics-based normalization for robust scoring across 5 MuJoCo environments.
Credits
- Built with: OpenAI Codex β code generation, architecture design, debugging, and iteration throughout the entire project
- Base model: MiniCPM-V-2.6 by OpenBMB
- Training framework: Hugging Face Transformers + PEFT (LoRA)
- Infrastructure: Modal (serverless GPU/CPU)
- Physics engine: MuJoCo via Minari datasets
- Audio analysis: librosa
Results
| Metric | Score |
|---|---|
| Hybrid Spearman correlation | 0.951 |
| Rule-based Spearman | 0.976 |
| Tier separation (expert - simple) | 0.371 |
| Expert mean reward | 0.915 |
| Medium mean reward | 0.717 |
| Simple mean reward | 0.544 |
| Test battery: expert range | 0.948 - 0.975 |
| Test battery: simple range | 0.025 - 0.245 |
Features
- Hybrid scoring: 95% physics-based rule normalization + 5% VLM qualitative analysis
- 5 MuJoCo environments: humanoid, walker2d, ant, hopper, halfcheetah (expert/medium/simple)
- Sound detection: Audio-based fall detection and gait analysis via librosa
- Web UI: FastAPI + HTML/JS interface on Modal GPU
- Fully serverless: All computation runs on Modal (GPU/CPU)
Quick Start
pip install robomind-vla
from robomind import RoboMindJudge, hybrid_judge
# VLM-only judgment
judge = RoboMindJudge()
judge.load()
result = judge.judge_from_paths(["frame1.jpg", "frame2.jpg", "frame3.jpg"])
# Hybrid scoring (with physics data)
from robomind.hybrid import hybrid_judge, hybrid_to_dict
score = hybrid_judge(
vlm_parsed=result,
ep_return=8000, min_return=4000, max_return=10000,
fell=False, tier="medium", env="walker2d",
)
print(hybrid_to_dict(score))
Project Structure
robomind/
βββ robomind/ # Installable Python package
β βββ __init__.py
β βββ judge.py # Core VLM judge class
β βββ hybrid.py # Hybrid VLM + rule-based scoring
β βββ sound.py # Audio-based fall/gait detection
βββ app.py # FastAPI web UI (Modal deployment)
βββ hybrid_judge.py # Standalone hybrid judge (used by app.py)
βββ data_gen_all_modal.py # Data generation (15 env combos x 20 episodes)
βββ dataset_build_v2.py # Dataset builder with visual analysis
βββ finetune_modal.py # LoRA fine-tune on Modal GPU
βββ validation.py # Validation suite on Modal GPU
βββ sound_detection.py # Sound detection on Modal
βββ tests_comprehensive.py # 18 unit/integration tests
βββ pyproject.toml # Package config
βββ LICENSE # MIT License
HF Hub Repos
| Repo | Description |
|---|---|
| mitvho09/robomind-rollouts | 300 rollout videos + metadata |
| mitvho09/robomind-loco-judge-dataset | 300 training samples with keyframes + judgments |
| mitvho09/robomind-minicpm-loco-lora | LoRA adapter (rank=64, 7 modules) |
Environments
| Environment | Expert | Medium | Simple |
|---|---|---|---|
| humanoid | 20 eps | 20 eps | 20 eps |
| walker2d | 20 eps | 20 eps | 20 eps |
| ant | 20 eps | 20 eps | 20 eps |
| hopper | 20 eps | 20 eps | 20 eps |
| halfcheetah | 20 eps | 20 eps | 20 eps |
How It Works
1. Data Generation (Modal CPU)
- Downloads Minari expert/medium/simple datasets
- Reconstructs states via
set_state()(no open-loop replay) - Renders
.mp4videos with MuJoCo physics
2. Training (Modal GPU)
- Extracts 6 keyframes per episode
- Derives judgment JSON (stability, gait_quality, predicted_reward, etc.)
- Fine-tunes MiniCPM-V-2.6 with LoRA (rank=64, alpha=128, 7 target modules)
3. Hybrid Scoring
Final Score = 0.95 * Rule_score + 0.05 * VLM_score
- Rule score: Physics-based return normalization with per-env calibration and tier adjustments
- VLM score: Combines stability assessment, gait quality, anomaly detection
- Tier adjustments: expert=0, medium=-0.15, simple=-0.35
4. Sound Detection
- Extracts audio from rollout videos
- Detects impacts, motor strain, gait rhythm
- Provides fall confidence score (penalizes reward when fall detected)
Running on Modal
# Generate data
modal run --detach data_gen_all_modal.py
# Build dataset
modal run dataset_build_v2.py
# Train (50 epochs, LoRA r=64)
modal run --detach finetune_modal.py::full_train
# Validate
modal run validation.py
# Deploy web UI
modal deploy app.py
Tests
python -m pytest tests_comprehensive.py -v
# 18/18 pass
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support