🧠 MoEVL‑Tiny‑0.5B

A pocket‑sized multimodal Mixture‑of‑Experts model trained for under $25.

MoEVL‑Tiny is an ultra‑efficient vision‑language model that pairs a frozen SigLIP‑SO400M vision encoder with a Qwen2.5‑0.5B language model augmented by custom sparse MoE layers and LoRA adapters. It is designed for on‑device image understanding, is capable of running on edge devices, and was trained entirely using free cloud credits.


🛠 Model Details

  • Developed by: Randhir Kumar (Latent AI Labs)
  • Model type: Multimodal Mixture‑of‑Experts (MoE) – image‑to‑text generation
  • Language: English
  • License: Apache 2.0
  • Base models:
    • Vision Encoder: google/siglip-so400m-patch14-384
    • Language Model: Qwen/Qwen2.5-0.5B-Instruct

Architecture Highlights

Component Specification
Vision Encoder SigLIP‑SO400M (frozen, ~400M params)
Projector 2‑layer MLP (SiLU activation) – trained from scratch
Language Model Qwen2.5‑0.5B (Trained in 4-bit QLoRA, merged and uploaded in pure BF16)
Sparse MoE Last 4 transformer blocks replaced by SwiGLU‑based MoE layers (4 experts, top‑2 routing)
Token Compression 2D adaptive average pooling: 729 → 49 visual tokens to save compute

Trainable parameters: ~25M (out of ~1.5B total capacity).
Active parameters per token: ~0.5B.


📊 Training Pipeline

Data

The model was trained in three specialized stages:

  1. LLaVA‑Instruct‑150K (~45k effective samples due to streaming constraints) – basic instruction tuning.
  2. PixMo‑Cap (~100k samples) – detailed captioning for dense visual grounding.
  3. HuggingFaceH4/llava-instruct-mix-vsft (30k samples) – final instruction fine‑tuning with embedded images.

Hyperparameters

  • Optimiser: AdamW
  • Learning rate: 5×10⁻⁵ (stages 1‑2), 2×10⁻⁵ (stage 3)
  • Effective batch size: 16 (batch 2 × gradient accumulation 8)
  • Precision: bfloat16 mixed precision
  • MoE auxiliary loss: Load‑balancing + z‑loss, coefficient 0.01

Hardware & Cost

  • GPU: 1× NVIDIA A10G (24 GB)
  • Platform: Modal.com
  • Total cloud cost: $0 (utilizing free tier credits)
  • Training time: ~18 hours across all stages

💻 Usage

Quick Start

The custom architecture (modeling_moevl_tiny.py) is embedded in the repository. Simply use trust_remote_code=True to load it directly.

import torch
from transformers import AutoProcessor
from huggingface_hub import snapshot_download
import sys

# 1. Download custom code and load model
model_path = snapshot_download("randhir302/MoEVL-Tiny-0.5B")
sys.path.append(model_path)
from modeling_moevl_tiny import MoEVLTinyForConditionalGeneration

model = MoEVLTinyForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
).to("cuda").eval()

# 2. Load standard SigLIP processor
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

# 3. Prepare image and inference
image = ...  # PIL Image
pixel_values = processor(images=image, return_tensors="pt").pixel_values
answer = model.generate(pixel_values, prompt_text="Describe this image.", max_new_tokens=128)

print(answer)

Generation Parameters

The model's internal generate method defaults to:

  • temperature=0.3
  • top_p=0.9
  • top_k=50
  • repetition_penalty=1.5
  • no_repeat_ngram_size=4

(These can be overridden by passing the corresponding arguments to the generate() function).


🔬 Evaluation & Qualitative Examples

The model has been tested on hand‑picked out-of-distribution images (Picsum, Pexels) and a 100‑sample subset from the training domain. While quantitative metrics (BLEU) are currently low due to the small data regime, qualitative outputs demonstrate solid visual grounding:

  • 🐱 Cat: "a kitten with yellow hair… its head in front of the camera"
  • 🍕 Pizza: "a pizza with brown sauce and cheese"
  • 🚀 Astronaut: "an astronaut in a space suit floating"
  • Coffee cup: "a cup of coffee on a wooden table"

⚠️ Known Limitations

  • Hallucination: The model may invent objects and context not present in the image, especially for complex or highly cluttered scenes.
  • Under‑training: Only ~30k instruction samples were used in the final stage. The model capacity can support much richer datasets.
  • Repetition loops: In greedy decoding modes, small models often repeat tokens. Use the recommended generation parameters (penalty & n-gram size) to mitigate this.

🚀 Future Work

  • Mobile/Edge deployment via GGUF and llama.cpp quantization.
  • Ablation studies: varying expert count, compression ratios, and projector depth.
  • Scaling the dataset: Training on MMInstruct and ShareGPT4V for advanced reasoning.

📝 Citation

@misc{MoEVL-Tiny-0.5B,
  author = {Randhir Kumar},
  title = {MoEVL-Tiny: A Pocket-Sized Multimodal Mixture-of-Experts Model Trained for Under $25},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/randhir302/MoEVL-Tiny-0.5B](https://huggingface.co/randhir302/MoEVL-Tiny-0.5B)}}
}

🤝 Contact

  • For questions, collaborations, or to follow the journey of building AI from scratch, open an issue on the Hugging Face model page or reach out on X (Twitter): @ranhdir302 | Latent AI Labs.
Downloads last month
274
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train randhir302/MoEVL-Tiny-0.5B

Space using randhir302/MoEVL-Tiny-0.5B 1