🧠 MoEVL‑Tiny‑0.5B
A pocket‑sized multimodal Mixture‑of‑Experts model trained for under $25.
MoEVL‑Tiny is an ultra‑efficient vision‑language model that pairs a frozen SigLIP‑SO400M vision encoder with a Qwen2.5‑0.5B language model augmented by custom sparse MoE layers and LoRA adapters. It is designed for on‑device image understanding, is capable of running on edge devices, and was trained entirely using free cloud credits.
🛠 Model Details
- Developed by: Randhir Kumar (Latent AI Labs)
- Model type: Multimodal Mixture‑of‑Experts (MoE) – image‑to‑text generation
- Language: English
- License: Apache 2.0
- Base models:
- Vision Encoder:
google/siglip-so400m-patch14-384 - Language Model:
Qwen/Qwen2.5-0.5B-Instruct
- Vision Encoder:
Architecture Highlights
| Component | Specification |
|---|---|
| Vision Encoder | SigLIP‑SO400M (frozen, ~400M params) |
| Projector | 2‑layer MLP (SiLU activation) – trained from scratch |
| Language Model | Qwen2.5‑0.5B (Trained in 4-bit QLoRA, merged and uploaded in pure BF16) |
| Sparse MoE | Last 4 transformer blocks replaced by SwiGLU‑based MoE layers (4 experts, top‑2 routing) |
| Token Compression | 2D adaptive average pooling: 729 → 49 visual tokens to save compute |
Trainable parameters: ~25M (out of ~1.5B total capacity).
Active parameters per token: ~0.5B.
📊 Training Pipeline
Data
The model was trained in three specialized stages:
- LLaVA‑Instruct‑150K (~45k effective samples due to streaming constraints) – basic instruction tuning.
- PixMo‑Cap (~100k samples) – detailed captioning for dense visual grounding.
- HuggingFaceH4/llava-instruct-mix-vsft (30k samples) – final instruction fine‑tuning with embedded images.
Hyperparameters
- Optimiser: AdamW
- Learning rate: 5×10⁻⁵ (stages 1‑2), 2×10⁻⁵ (stage 3)
- Effective batch size: 16 (batch 2 × gradient accumulation 8)
- Precision: bfloat16 mixed precision
- MoE auxiliary loss: Load‑balancing + z‑loss, coefficient 0.01
Hardware & Cost
- GPU: 1× NVIDIA A10G (24 GB)
- Platform: Modal.com
- Total cloud cost: $0 (utilizing free tier credits)
- Training time: ~18 hours across all stages
💻 Usage
Quick Start
The custom architecture (modeling_moevl_tiny.py) is embedded in the repository. Simply use trust_remote_code=True to load it directly.
import torch
from transformers import AutoProcessor
from huggingface_hub import snapshot_download
import sys
# 1. Download custom code and load model
model_path = snapshot_download("randhir302/MoEVL-Tiny-0.5B")
sys.path.append(model_path)
from modeling_moevl_tiny import MoEVLTinyForConditionalGeneration
model = MoEVLTinyForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16
).to("cuda").eval()
# 2. Load standard SigLIP processor
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
# 3. Prepare image and inference
image = ... # PIL Image
pixel_values = processor(images=image, return_tensors="pt").pixel_values
answer = model.generate(pixel_values, prompt_text="Describe this image.", max_new_tokens=128)
print(answer)
Generation Parameters
The model's internal generate method defaults to:
temperature=0.3top_p=0.9top_k=50repetition_penalty=1.5no_repeat_ngram_size=4
(These can be overridden by passing the corresponding arguments to the generate() function).
🔬 Evaluation & Qualitative Examples
The model has been tested on hand‑picked out-of-distribution images (Picsum, Pexels) and a 100‑sample subset from the training domain. While quantitative metrics (BLEU) are currently low due to the small data regime, qualitative outputs demonstrate solid visual grounding:
- 🐱 Cat: "a kitten with yellow hair… its head in front of the camera"
- 🍕 Pizza: "a pizza with brown sauce and cheese"
- 🚀 Astronaut: "an astronaut in a space suit floating"
- ☕ Coffee cup: "a cup of coffee on a wooden table"
⚠️ Known Limitations
- Hallucination: The model may invent objects and context not present in the image, especially for complex or highly cluttered scenes.
- Under‑training: Only ~30k instruction samples were used in the final stage. The model capacity can support much richer datasets.
- Repetition loops: In greedy decoding modes, small models often repeat tokens. Use the recommended generation parameters (penalty & n-gram size) to mitigate this.
🚀 Future Work
- Mobile/Edge deployment via GGUF and llama.cpp quantization.
- Ablation studies: varying expert count, compression ratios, and projector depth.
- Scaling the dataset: Training on MMInstruct and ShareGPT4V for advanced reasoning.
📝 Citation
@misc{MoEVL-Tiny-0.5B,
author = {Randhir Kumar},
title = {MoEVL-Tiny: A Pocket-Sized Multimodal Mixture-of-Experts Model Trained for Under $25},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/randhir302/MoEVL-Tiny-0.5B](https://huggingface.co/randhir302/MoEVL-Tiny-0.5B)}}
}
🤝 Contact
- For questions, collaborations, or to follow the journey of building AI from scratch, open an issue on the Hugging Face model page or reach out on X (Twitter): @ranhdir302 | Latent AI Labs.
- Downloads last month
- 274