SmolVLA Model - First Edition (alphabot_smolvla_1st_edition)

A fine-tuned Vision-Language-Action (VLA) model trained with LeRobot for robot control tasks.

Model Description

This model is a specialized fine-tuning of SmolVLA on the alphabot2 robot dataset, enabling the model to understand and execute robot control tasks through vision and language understanding.

Model Architecture

Base Architecture: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Type: Vision-Language-Action (VLA) Policy
Total Parameters: 450M
Learnable Parameters: 100M
Framework: LeRobot

Training Details

Dataset

Dataset ID: alphabot2/ai2robot_full_900_episodes_10fps_JC_intern_matched_from_full
Total Frames: 163,545
Total Episodes: 831
Video FPS: 10 fps
Chunk Size: 50 frames
Action Horizon: 50 steps

Training Configuration

Training Steps: 100,000
Batch Size: 8 (optimized for RTX 3070 8GB GPU)
Optimizer: AdamW
Learning Rate: 1e-4
Scheduler: Cosine decay with warmup
Mixed Precision: AMP (Automatic Mixed Precision) enabled
Training Time: ~48 hours on NVIDIA RTX 3070 Laptop
Grad Clip Norm: 1.0

Preprocessing

Video Backend: torchcodec with pyav fallback
Input Normalization: Running statistics computed from training dataset
Output Denormalization: Inverse normalization of training action statistics
Resolution: 224x224 (standard for SmolVLM)

Usage

Loading the Model

from lerobot.policies.pretrained import PreTrainedPolicy
import torch

# Load the model from HuggingFace Hub
policy = PreTrainedPolicy.from_pretrained(
    "alphabot2/alphabot_smolvla_1st_edition"
)

# Set to evaluation mode (important!)
policy.eval()

# Run inference (assumes observation dict with images, proprio, etc.)
with torch.no_grad():
    action = policy.select_action(observation_dict)

Inference Requirements

PyTorch with CUDA support (GPU recommended)
LeRobot library
torchcodec for video processing
~2GB GPU VRAM minimum for inference
Input observation must include visual data and proprioceptive state

Model Card Details

Intended Use

Primary: Robot control and imitation learning via vision and language
Supported Tasks: Robot manipulation tasks in the alphabot environment
Training Data: Demonstrations collected from human operators

Limitations

Model is specialized for the alphabot2 robot platform
Performance on out-of-distribution scenarios may be limited
Requires proper observation preprocessing (normalization, etc.)

Ethical Considerations

This is an imitation learning model trained on human demonstrations
Use only in controlled research/educational environments
Not intended for autonomous systems without human oversight
Ensure compliance with local regulations for robot operation

Training Procedure

The model was trained using LeRobot's standard training pipeline:

Data Loading: Video frames processed with torchcodec backend
Error Handling: Corrupted samples automatically skipped during training
Batch Processing: 8 samples per batch with gradient accumulation
Loss Function: Standard policy gradient loss
Evaluation: Periodic evaluation every 1,000 steps
Checkpointing: Saved every 5,000 steps

All training artifacts include proper preprocessor/postprocessor configurations for handling input normalization and output denormalization.

Hardware Requirements

For Inference

Minimum GPU: 2GB VRAM (e.g., RTX 2060)
Recommended GPU: 4GB+ VRAM (e.g., RTX 3060 or better)
CPU: Modern Intel/AMD processor (4+ cores recommended)
RAM: 8GB minimum

For Fine-tuning

Recommended GPU: 12GB+ VRAM (e.g., RTX 3090, RTX 4090, A100)
GPU Memory: Larger batch sizes require proportionally more VRAM
Storage: ~10GB for dataset + checkpoint files

Files in Repository

model.safetensors - Trained model weights (1.2GB)
config.json - Model architecture configuration
policy_preprocessor.json - Input preprocessing configuration
policy_postprocessor.json - Output postprocessing configuration
policy_preprocessor_step_5_normalizer_processor.safetensors - Normalizer state
policy_postprocessor_step_0_unnormalizer_processor.safetensors - Denormalizer state
train_config.json - Training configuration metadata
README.md - This file

License

Apache License 2.0 - See LICENSE file for details

Citation

If you use this model, please cite:

@software{lerobot,
  title={LeRobot: An Open-Source Platform for Robotics Imitation Learning},
  author={Zambaldi, Victor and others},
  url={https://github.com/huggingface/lerobot},
  year={2024}
}

@article{navidi2024smolvla,
  title={SmolVLA: A Compact Vision-Language-Action Model for Robotics},
  author={Navidi, H. and others},
  journal={arXiv preprint arXiv:2405.14850},
  year={2024}
}

Support & Questions

For issues or questions about:

This Model: Check the LeRobot documentation
LeRobot Framework: Visit GitHub repository
HuggingFace Hub: See hub documentation

Disclaimer

This model is provided as-is. Users are responsible for ensuring its safe and appropriate use. The authors are not liable for any misuse or damage caused by this model.

Model Card Last Updated: 2026-06-15 Training Completed: 2026-06-13 Total Training Steps: 100,000

Downloads last month: 15

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics

Paper for alphabot2/alphabot_smolvla_1st_edition

Floer-theoretic filtration on Painlevé Hitchin systems

Paper • 2405.14850 • Published May 15, 2025