SmolVLA Model - First Edition (alphabot_smolvla_1st_edition)

A fine-tuned Vision-Language-Action (VLA) model trained with LeRobot for robot control tasks.

Model Description

This model is a specialized fine-tuning of SmolVLA on the alphabot2 robot dataset, enabling the model to understand and execute robot control tasks through vision and language understanding.

Model Architecture

  • Base Architecture: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
  • Type: Vision-Language-Action (VLA) Policy
  • Total Parameters: 450M
  • Learnable Parameters: 100M
  • Framework: LeRobot

Training Details

Dataset

  • Dataset ID: alphabot2/ai2robot_full_900_episodes_10fps_JC_intern_matched_from_full
  • Total Frames: 163,545
  • Total Episodes: 831
  • Video FPS: 10 fps
  • Chunk Size: 50 frames
  • Action Horizon: 50 steps

Training Configuration

  • Training Steps: 100,000
  • Batch Size: 8 (optimized for RTX 3070 8GB GPU)
  • Optimizer: AdamW
  • Learning Rate: 1e-4
  • Scheduler: Cosine decay with warmup
  • Mixed Precision: AMP (Automatic Mixed Precision) enabled
  • Training Time: ~48 hours on NVIDIA RTX 3070 Laptop
  • Grad Clip Norm: 1.0

Preprocessing

  • Video Backend: torchcodec with pyav fallback
  • Input Normalization: Running statistics computed from training dataset
  • Output Denormalization: Inverse normalization of training action statistics
  • Resolution: 224x224 (standard for SmolVLM)

Usage

Loading the Model

from lerobot.policies.pretrained import PreTrainedPolicy
import torch

# Load the model from HuggingFace Hub
policy = PreTrainedPolicy.from_pretrained(
    "alphabot2/alphabot_smolvla_1st_edition"
)

# Set to evaluation mode (important!)
policy.eval()

# Run inference (assumes observation dict with images, proprio, etc.)
with torch.no_grad():
    action = policy.select_action(observation_dict)

Inference Requirements

  • PyTorch with CUDA support (GPU recommended)
  • LeRobot library
  • torchcodec for video processing
  • ~2GB GPU VRAM minimum for inference
  • Input observation must include visual data and proprioceptive state

Model Card Details

Intended Use

  • Primary: Robot control and imitation learning via vision and language
  • Supported Tasks: Robot manipulation tasks in the alphabot environment
  • Training Data: Demonstrations collected from human operators

Limitations

  • Model is specialized for the alphabot2 robot platform
  • Performance on out-of-distribution scenarios may be limited
  • Requires proper observation preprocessing (normalization, etc.)

Ethical Considerations

  • This is an imitation learning model trained on human demonstrations
  • Use only in controlled research/educational environments
  • Not intended for autonomous systems without human oversight
  • Ensure compliance with local regulations for robot operation

Training Procedure

The model was trained using LeRobot's standard training pipeline:

  1. Data Loading: Video frames processed with torchcodec backend
  2. Error Handling: Corrupted samples automatically skipped during training
  3. Batch Processing: 8 samples per batch with gradient accumulation
  4. Loss Function: Standard policy gradient loss
  5. Evaluation: Periodic evaluation every 1,000 steps
  6. Checkpointing: Saved every 5,000 steps

All training artifacts include proper preprocessor/postprocessor configurations for handling input normalization and output denormalization.

Hardware Requirements

For Inference

  • Minimum GPU: 2GB VRAM (e.g., RTX 2060)
  • Recommended GPU: 4GB+ VRAM (e.g., RTX 3060 or better)
  • CPU: Modern Intel/AMD processor (4+ cores recommended)
  • RAM: 8GB minimum

For Fine-tuning

  • Recommended GPU: 12GB+ VRAM (e.g., RTX 3090, RTX 4090, A100)
  • GPU Memory: Larger batch sizes require proportionally more VRAM
  • Storage: ~10GB for dataset + checkpoint files

Files in Repository

  • model.safetensors - Trained model weights (1.2GB)
  • config.json - Model architecture configuration
  • policy_preprocessor.json - Input preprocessing configuration
  • policy_postprocessor.json - Output postprocessing configuration
  • policy_preprocessor_step_5_normalizer_processor.safetensors - Normalizer state
  • policy_postprocessor_step_0_unnormalizer_processor.safetensors - Denormalizer state
  • train_config.json - Training configuration metadata
  • README.md - This file

License

Apache License 2.0 - See LICENSE file for details

Citation

If you use this model, please cite:

@software{lerobot,
  title={LeRobot: An Open-Source Platform for Robotics Imitation Learning},
  author={Zambaldi, Victor and others},
  url={https://github.com/huggingface/lerobot},
  year={2024}
}

@article{navidi2024smolvla,
  title={SmolVLA: A Compact Vision-Language-Action Model for Robotics},
  author={Navidi, H. and others},
  journal={arXiv preprint arXiv:2405.14850},
  year={2024}
}

Support & Questions

For issues or questions about:

Disclaimer

This model is provided as-is. Users are responsible for ensuring its safe and appropriate use. The authors are not liable for any misuse or damage caused by this model.


Model Card Last Updated: 2026-06-15 Training Completed: 2026-06-13 Total Training Steps: 100,000

Downloads last month
15
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Paper for alphabot2/alphabot_smolvla_1st_edition