Mini VLM (Built from Scratch)

Vision Language Model built from scratch. Architecture: CLIP (frozen) + Projection Layer + Custom LLM decoder.

Architecture

  • Vision: CLIP ViT-B/32 (frozen)
  • Projection: Linear(512 → 384)
  • LLM: Custom Transformer (6 layers, 384 dim)
  • Dataset: COCO Captions (20k samples)
  • GPU: NVIDIA L4

Training

  • Epochs: 3 | Final loss: 1.17
  • Same pipeline as LLaVA Stage 1!
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support