🧠 MoEVL‑Tiny‑0.5B

A pocket‑sized multimodal Mixture‑of‑Experts model trained for under $25.

MoEVL‑Tiny is an ultra‑efficient vision‑language model that pairs a frozen SigLIP‑SO400M vision encoder with a Qwen2.5‑0.5B language model augmented by custom sparse MoE layers and LoRA adapters. It is designed for on‑device image understanding, is capable of running on edge devices, and was trained entirely using free cloud credits.

🛠 Model Details

Developed by: Randhir Kumar (Latent AI Labs)
Model type: Multimodal Mixture‑of‑Experts (MoE) – image‑to‑text generation
Language: English
License: Apache 2.0
Base models:
- Vision Encoder: google/siglip-so400m-patch14-384
- Language Model: Qwen/Qwen2.5-0.5B-Instruct

Architecture Highlights

Component	Specification
Vision Encoder	SigLIP‑SO400M (frozen, ~400M params)
Projector	2‑layer MLP (SiLU activation) – trained from scratch
Language Model	Qwen2.5‑0.5B (Trained in 4-bit QLoRA, merged and uploaded in pure BF16)
Sparse MoE	Last 4 transformer blocks replaced by SwiGLU‑based MoE layers (4 experts, top‑2 routing)
Token Compression	2D adaptive average pooling: 729 → 49 visual tokens to save compute

Trainable parameters: ~25M (out of ~1.5B total capacity).
Active parameters per token: ~0.5B.

📊 Training Pipeline

Data

The model was trained in three specialized stages:

LLaVA‑Instruct‑150K (~45k effective samples due to streaming constraints) – basic instruction tuning.
PixMo‑Cap (~100k samples) – detailed captioning for dense visual grounding.
HuggingFaceH4/llava-instruct-mix-vsft (30k samples) – final instruction fine‑tuning with embedded images.

Hyperparameters

Optimiser: AdamW
Learning rate: 5×10⁻⁵ (stages 1‑2), 2×10⁻⁵ (stage 3)
Effective batch size: 16 (batch 2 × gradient accumulation 8)
Precision: bfloat16 mixed precision
MoE auxiliary loss: Load‑balancing + z‑loss, coefficient 0.01

Hardware & Cost

GPU: 1× NVIDIA A10G (24 GB)
Platform: Modal.com
Total cloud cost: $0 (utilizing free tier credits)
Training time: ~18 hours across all stages

💻 Usage

Quick Start

The custom architecture (modeling_moevl_tiny.py) is embedded in the repository. Simply use trust_remote_code=True to load it directly.

import torch
from transformers import AutoProcessor
from huggingface_hub import snapshot_download
import sys

# 1. Download custom code and load model
model_path = snapshot_download("randhir302/MoEVL-Tiny-0.5B")
sys.path.append(model_path)
from modeling_moevl_tiny import MoEVLTinyForConditionalGeneration

model = MoEVLTinyForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
).to("cuda").eval()

# 2. Load standard SigLIP processor
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

# 3. Prepare image and inference
image = ...  # PIL Image
pixel_values = processor(images=image, return_tensors="pt").pixel_values
answer = model.generate(pixel_values, prompt_text="Describe this image.", max_new_tokens=128)

print(answer)

Generation Parameters

The model's internal generate method defaults to:

temperature=0.3
top_p=0.9
top_k=50
repetition_penalty=1.5
no_repeat_ngram_size=4

(These can be overridden by passing the corresponding arguments to the generate() function).

🔬 Evaluation & Qualitative Examples

The model has been tested on hand‑picked out-of-distribution images (Picsum, Pexels) and a 100‑sample subset from the training domain. While quantitative metrics (BLEU) are currently low due to the small data regime, qualitative outputs demonstrate solid visual grounding:

🐱 Cat: "a kitten with yellow hair… its head in front of the camera"
🍕 Pizza: "a pizza with brown sauce and cheese"
🚀 Astronaut: "an astronaut in a space suit floating"
☕ Coffee cup: "a cup of coffee on a wooden table"

⚠️ Known Limitations

Hallucination: The model may invent objects and context not present in the image, especially for complex or highly cluttered scenes.
Under‑training: Only ~30k instruction samples were used in the final stage. The model capacity can support much richer datasets.
Repetition loops: In greedy decoding modes, small models often repeat tokens. Use the recommended generation parameters (penalty & n-gram size) to mitigate this.

🚀 Future Work

Mobile/Edge deployment via GGUF and llama.cpp quantization.
Ablation studies: varying expert count, compression ratios, and projector depth.
Scaling the dataset: Training on MMInstruct and ShareGPT4V for advanced reasoning.

📝 Citation

@misc{MoEVL-Tiny-0.5B,
  author = {Randhir Kumar},
  title = {MoEVL-Tiny: A Pocket-Sized Multimodal Mixture-of-Experts Model Trained for Under $25},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/randhir302/MoEVL-Tiny-0.5B](https://huggingface.co/randhir302/MoEVL-Tiny-0.5B)}}
}

🤝 Contact

For questions, collaborations, or to follow the journey of building AI from scratch, open an issue on the Hugging Face model page or reach out on X (Twitter): @ranhdir302 | Latent AI Labs.

Downloads last month: 274

Safetensors

Model size

0.5B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

randhir302
/

MoEVL-Tiny-0.5B