π¬ SupraVL-Nano-900k
A minimal, fully transparent Vision-Language Model trained from scratch
The smallest VLM you can actually understand, every component built from scratch.
What is this?
SupraVL-Nano-900k is a tiny Vision-Language Model built entirely from scratch by SupraLabs as an educational reference implementation.
With only ~900k parameters, it demonstrates the core mechanics of modern VLMs β spatial visual tokenization, cross-modal attention, and autoregressive caption generation, in code that fits in a single Jupyter notebook.
This is not a production model. It is a transparent, readable blueprint for anyone who wants to understand how image-to-text models actually work under the hood.
Architecture
Image (3 Γ 112 Γ 112)
β
βΌ
βββββββββββββββββββββββββββββββββββ
β CNN Encoder β
β 3Γ Conv-BN-GELU + MaxPool β
β AdaptiveAvgPool(4Γ4) β
β β 16 spatial tokens Γ 64-d β
ββββββββββββββββββ¬βββββββββββββββββ
β Linear projection (64 β 128)
βΌ
16 Visual Tokens (128-d)
β
β prepend to text sequence
βΌ
βββββββββββββββββββββββββββββββββββ
β GPT-2-style Transformer Decoderβ
β 3 layers Β· d=128 Β· 4 heads β
β FF dim=256 Β· causal masking β
β weight-tied LM head β
ββββββββββββββββββ¬βββββββββββββββββ
β
βΌ
Caption tokens (BOS β EOS)
Key design decisions
| Choice | Why |
|---|---|
| 16 spatial tokens (4Γ4 grid) instead of 1 global token | The decoder can attend to different image regions when generating different words β closer to how real VLMs work |
| BPE tokenizer (2048 tokens, trained on Flickr8k) | Handles real natural language; splits unknown words instead of failing |
| Prefix concatenation for vision-language fusion | Simple and effective β visual tokens prepended to the text sequence, sharing the same attention mechanism |
| Weight tying (token emb β LM head) | Standard trick from GPT-2; reduces params and stabilizes training |
| Mixed precision (AMP) | Full FP16/BF16 training on CUDA out of the box |
Training
| Setting | Value |
|---|---|
| Dataset | Flickr8k β 30k train / 5k val pairs |
| Epochs | 15 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.01) |
| Learning rate | 3e-4 β cosine decay β 3e-5 |
| Batch size | 64 |
| Precision | Mixed (AMP) |
| Gradient clipping | 1.0 |
| Loss | Cross-entropy next-token prediction (teacher forcing) |
| Hardware | Kaggle 2Γ T4 / Google Colab T4 |
Inference
Quick start
"pip install torch torchvision pillow huggingface_hub safetensors tokenizers"
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import json, math
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
import torchvision.transforms as T
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from tokenizers import Tokenizer
# ββ 1. Download de artefatos ββββββββββββββββββββββββββββββββββββββββββββββββββ
REPO = "supralabs/SupraVL-Nano-900k"
ckpt_path = hf_hub_download(REPO, "model.safetensors")
tok_path = hf_hub_download(REPO, "tokenizer.json")
cfg_path = hf_hub_download(REPO, "config.json")
with open(cfg_path) as f:
cfg = json.load(f)
tokenizer = Tokenizer.from_file(tok_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Config carregado:", json.dumps(cfg, indent=2))
# ββ 2. HiperparΓ’metros β mapeados das chaves REAIS do config.json βββββββββββββ
#
# config.json real:
# { "D_MODEL":128, "N_HEADS":4, "N_LAYERS":3, "D_FF":256,
# "VIS_CH":64, "N_VIS":16, "VOCAB_SIZE":2048,
# "MAX_SEQ":48, "IMG_SIZE":112 }
#
N_EMBD = cfg["D_MODEL"] # 128 β dimensΓ£o dos embeddings
N_HEAD = cfg["N_HEADS"] # 4 β cabeΓ§as de atenΓ§Γ£o
N_LAYER = cfg["N_LAYERS"] # 3 β blocos transformer
D_FF = cfg["D_FF"] # 256 β dim interna do MLP (β 4ΓN_EMBD!)
VIS_CH = cfg["VIS_CH"] # 64 β canais de saΓda da CNN visual
VIS_TOKENS = cfg["N_VIS"] # 16 β tokens visuais projetados
VOCAB_SIZE = cfg["VOCAB_SIZE"] # 2048
MAX_SEQ = cfg["MAX_SEQ"] # 48 β comprimento mΓ‘ximo de texto
IMG_SIZE = cfg["IMG_SIZE"] # 112
TOTAL_POS = VIS_TOKENS + MAX_SEQ # 64 β posiΓ§Γ΅es totais (visual + texto)
# Tokens especiais β padrΓ£o BPE; ajuste se o tokenizer usar outros ids
BOS_ID = cfg.get("bos_token_id", 1)
EOS_ID = cfg.get("eos_token_id", 2)
# ββ 3. MΓ³dulos auxiliares βββββββββββββββββββββββββββββββββββββββββββββββββββββ
class CausalSelfAttention(nn.Module):
def __init__(self):
super().__init__()
assert N_EMBD % N_HEAD == 0, "N_EMBD deve ser divisΓvel por N_HEAD"
self.qkv = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False)
self.proj = nn.Linear(N_EMBD, N_EMBD, bias=False)
self.n_head = N_HEAD
# MΓ‘scara causal de tamanho TOTAL_POS Γ TOTAL_POS
self.register_buffer(
"mask",
torch.tril(torch.ones(TOTAL_POS, TOTAL_POS))
.view(1, 1, TOTAL_POS, TOTAL_POS)
)
def forward(self, x):
B, T, C = x.shape
nh, hs = self.n_head, C // self.n_head
q, k, v = self.qkv(x).split(C, dim=-1)
q = q.view(B, T, nh, hs).transpose(1, 2)
k = k.view(B, T, nh, hs).transpose(1, 2)
v = v.view(B, T, nh, hs).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) * (hs ** -0.5)
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
y = (att @ v).transpose(1, 2).contiguous().view(B, T, C)
return self.proj(y)
class MLP(nn.Module):
"""MLP com D_FF=256 (conforme config), NΓO 4ΓN_EMBD=512."""
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(N_EMBD, D_FF) # 128 β 256
self.fc2 = nn.Linear(D_FF, N_EMBD) # 256 β 128
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x)))
class Block(nn.Module):
def __init__(self):
super().__init__()
self.ln1 = nn.LayerNorm(N_EMBD)
self.attn = CausalSelfAttention()
self.ln2 = nn.LayerNorm(N_EMBD)
self.mlp = MLP()
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
# ββ 4. Encoder visual (CNN 3 camadas β VIS_CH=64) ββββββββββββββββββββββββββββ
class VisualEncoder(nn.Module):
"""
Progride os canais: 3 β VIS_CH//4 β VIS_CH//2 β VIS_CH
3 β 16 β 32 β 64
Cada conv tem stride=2, entΓ£o 112Γ112 β 56Γ56 β 28Γ28 β 14Γ14.
AdaptiveAvgPool2d(4,4) β 16 tokens (= N_VIS = 4Γ4).
Linear 64 β 128 projeta para o espaΓ§o do transformer.
"""
def __init__(self):
super().__init__()
c1 = VIS_CH // 4 # 16
c2 = VIS_CH // 2 # 32
c3 = VIS_CH # 64
self.conv1 = nn.Sequential(
nn.Conv2d(3, c1, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(c1), nn.ReLU(inplace=True)
)
self.conv2 = nn.Sequential(
nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(c2), nn.ReLU(inplace=True)
)
self.conv3 = nn.Sequential(
nn.Conv2d(c2, c3, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(c3), nn.ReLU(inplace=True)
)
grid = int(VIS_TOKENS ** 0.5) # 4 β 4Γ4 = 16 tokens
self.pool = nn.AdaptiveAvgPool2d((grid, grid))
self.proj = nn.Linear(c3, N_EMBD) # 64 β 128
def forward(self, x):
x = self.conv3(self.conv2(self.conv1(x))) # (B, 64, 14, 14)
x = self.pool(x) # (B, 64, 4, 4)
B, C, H, W = x.shape
x = x.view(B, C, H * W).transpose(1, 2) # (B, 16, 64)
return self.proj(x) # (B, 16, 128)
# ββ 5. Modelo completo MiniVLM ββββββββββββββββββββββββββββββββββββββββββββββββ
class MiniVLM(nn.Module):
def __init__(self):
super().__init__()
self.vis_enc = VisualEncoder()
self.tok_emb = nn.Embedding(VOCAB_SIZE, N_EMBD) # (2048, 128)
self.pos_emb = nn.Embedding(TOTAL_POS, N_EMBD) # (64, 128)
self.blocks = nn.ModuleList([Block() for _ in range(N_LAYER)])
self.ln_f = nn.LayerNorm(N_EMBD)
self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE, bias=False) # weight-tied
def forward(self, img_tokens: torch.Tensor, tok_ids: torch.Tensor):
"""
img_tokens : (B, VIS_TOKENS, N_EMBD)
tok_ids : (B, T) T β€ MAX_SEQ
"""
B, T = tok_ids.shape
tok = self.tok_emb(tok_ids) # (B, T, 128)
seq = torch.cat([img_tokens, tok], dim=1) # (B, 16+T, 128)
pos = self.pos_emb(
torch.arange(VIS_TOKENS + T, device=tok_ids.device)
) # (16+T, 128)
x = seq + pos.unsqueeze(0)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
return self.lm_head(x) # (B, 16+T, 2048)
@torch.no_grad()
def generate_beam(
self,
img: torch.Tensor,
beam_width: int = 3,
max_new: int = 48,
) -> str:
self.eval()
img_tokens = self.vis_enc(img) # (1, 16, 128)
# beam = (log_score, lista de token_ids)
beams: list[tuple[float, list[int]]] = [(0.0, [BOS_ID])]
for _ in range(max_new):
candidates = []
for score, seq in beams:
if seq[-1] == EOS_ID:
candidates.append((score, seq))
continue
ids = torch.tensor([seq], dtype=torch.long, device=img.device)
logits = self.forward(img_tokens, ids) # (1, 16+T, 2048)
# logit do prΓ³ximo token de texto (posiΓ§Γ£o apΓ³s os visuais)
next_lg = logits[0, VIS_TOKENS + len(seq) - 1]
lprobs = F.log_softmax(next_lg, dim=-1)
topk = torch.topk(lprobs, beam_width)
for lp, tok in zip(topk.values.tolist(), topk.indices.tolist()):
candidates.append((score + lp, seq + [tok]))
# MantΓ©m os top-k beams
beams = sorted(candidates, key=lambda x: x[0], reverse=True)[:beam_width]
if all(s[-1] == EOS_ID for _, s in beams):
break
best = beams[0][1]
best = [t for t in best if t not in (BOS_ID, EOS_ID)]
return tokenizer.decode(best)
# ββ 6. Instanciar, carregar pesos βββββββββββββββββββββββββββββββββββββββββββββ
model = MiniVLM()
weights = load_file(ckpt_path, device=str(device))
missing, unexpected = model.load_state_dict(weights, strict=False)
if missing:
print(f"[WARN] Pesos ausentes : {missing}")
if unexpected:
print(f"[WARN] Pesos inesperados: {unexpected}")
# Weight tying: lm_head compartilha a matriz de tok_emb
model.lm_head.weight = model.tok_emb.weight
model.to(device).eval()
print(f"[OK] Modelo carregado em {device} | params: {sum(p.numel() for p in model.parameters()):,}")
# ββ 7. PrΓ©-processamento ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
transform = T.Compose([
T.Resize((IMG_SIZE, IMG_SIZE)),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
IMAGE_PATH = "image.png" # β put the real path
img = Image.open(IMAGE_PATH).convert("RGB")
img_t = transform(img).unsqueeze(0).to(device) # (1, 3, 112, 112)
# ββ 8. Gerar legenda ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
caption = model.generate_beam(img_t, beam_width=3, max_new=48)
print("Legenda:", caption)
Generation strategies
| Method | Call | Characteristic |
|---|---|---|
| Greedy | model.generate_greedy(img) |
Fast, deterministic |
| Top-k sampling | model.generate_topk(img, temperature=0.8, top_k=50) |
Diverse, creative |
| Beam search | model.generate_beam(img, beam_width=3) |
Most fluent |
Samples
Limitations
This model is intentionally minimal. Expect:
- Short, generic captions β "a dog is running in the grass" rather than rich descriptions
- Repetition on out-of-distribution images
- Limited vocabulary β 2048 BPE tokens covers Flickr8k well but not the full web
- No instruction following β purely image β caption, no chat or Q&A
- 112Γ112 input resolution β fine detail is lost
It was trained on a single consumer GPU in under an hour. Manage expectations accordingly.
Roadmap
This model is a foundation. Natural next steps to scale it:
- Replace CNN with a tiny ViT patch encoder
- Cross-attention layers instead of prefix concatenation (Γ la Flamingo)
- Pretrained frozen backbone (CLIP ViT-B/32)
- Larger vocabulary (30k+ tokens via tiktoken)
- Scale decoder to 6β12 layers, d=512+
- Train on CC3M / LAION-400M
- Scale SO much up
Repo contents
| File | Description |
|---|---|
model.pt |
Best checkpoint (model + optimizer + scheduler state) |
tokenizer.json |
BPE tokenizer trained on Flickr8k captions |
config.json |
Architecture hyperparameters |
README.md |
This file |
Citation
@misc{supravl-nano-900k,
author = {SupraLabs},
title = {SupraVL-Nano-900k: A Minimal Vision-Language Model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/supralabs/SupraVL-Nano-900k}},
}
License
Apache 2.0 β free to use, modify, and distribute with attribution.
Built with π¬ by SupraLabs Β· small models, big ideas
- Downloads last month
- 3


