🔬 SupraVL-Nano-900k

A minimal, fully transparent Vision-Language Model trained from scratch

The smallest VLM you can actually understand, every component built from scratch.

What is this?

SupraVL-Nano-900k is a tiny Vision-Language Model built entirely from scratch by SupraLabs as an educational reference implementation.

With only ~900k parameters, it demonstrates the core mechanics of modern VLMs — spatial visual tokenization, cross-modal attention, and autoregressive caption generation, in code that fits in a single Jupyter notebook.

This is not a production model. It is a transparent, readable blueprint for anyone who wants to understand how image-to-text models actually work under the hood.

Architecture

Image (3 × 112 × 112)
    │
    ▼
┌─────────────────────────────────┐
│  CNN Encoder                    │
│  3× Conv-BN-GELU + MaxPool      │
│  AdaptiveAvgPool(4×4)           │
│  → 16 spatial tokens × 64-d    │
└────────────────┬────────────────┘
                 │  Linear projection (64 → 128)
                 ▼
         16 Visual Tokens (128-d)
                 │
                 │  prepend to text sequence
                 ▼
┌─────────────────────────────────┐
│  GPT-2-style Transformer Decoder│
│  3 layers · d=128 · 4 heads     │
│  FF dim=256 · causal masking     │
│  weight-tied LM head             │
└────────────────┬────────────────┘
                 │
                 ▼
         Caption tokens (BOS → EOS)

Key design decisions

Choice	Why
16 spatial tokens (4×4 grid) instead of 1 global token	The decoder can attend to different image regions when generating different words — closer to how real VLMs work
BPE tokenizer (2048 tokens, trained on Flickr8k)	Handles real natural language; splits unknown words instead of failing
Prefix concatenation for vision-language fusion	Simple and effective — visual tokens prepended to the text sequence, sharing the same attention mechanism
Weight tying (token emb ↔ LM head)	Standard trick from GPT-2; reduces params and stabilizes training
Mixed precision (AMP)	Full FP16/BF16 training on CUDA out of the box

Training

Setting	Value
Dataset	Flickr8k — 30k train / 5k val pairs
Epochs	15
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.01)
Learning rate	3e-4 → cosine decay → 3e-5
Batch size	64
Precision	Mixed (AMP)
Gradient clipping	1.0
Loss	Cross-entropy next-token prediction (teacher forcing)
Hardware	Kaggle 2× T4 / Google Colab T4

Inference

Quick start

"pip install torch torchvision pillow huggingface_hub safetensors tokenizers"


# ══════════════════════════════════════════════════════════════════════════════

import json, math
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
import torchvision.transforms as T
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from tokenizers import Tokenizer

# ── 1. Download de artefatos ──────────────────────────────────────────────────
REPO = "supralabs/SupraVL-Nano-900k"

ckpt_path = hf_hub_download(REPO, "model.safetensors")
tok_path  = hf_hub_download(REPO, "tokenizer.json")
cfg_path  = hf_hub_download(REPO, "config.json")

with open(cfg_path) as f:
    cfg = json.load(f)

tokenizer = Tokenizer.from_file(tok_path)
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Config carregado:", json.dumps(cfg, indent=2))

# ── 2. Hiperparâmetros – mapeados das chaves REAIS do config.json ─────────────
#
#  config.json real:
#  { "D_MODEL":128, "N_HEADS":4, "N_LAYERS":3, "D_FF":256,
#    "VIS_CH":64,   "N_VIS":16,  "VOCAB_SIZE":2048,
#    "MAX_SEQ":48,  "IMG_SIZE":112 }
#
N_EMBD     = cfg["D_MODEL"]      # 128  – dimensão dos embeddings
N_HEAD     = cfg["N_HEADS"]      # 4    – cabeças de atenção
N_LAYER    = cfg["N_LAYERS"]     # 3    – blocos transformer
D_FF       = cfg["D_FF"]         # 256  – dim interna do MLP (≠ 4×N_EMBD!)
VIS_CH     = cfg["VIS_CH"]       # 64   – canais de saída da CNN visual
VIS_TOKENS = cfg["N_VIS"]        # 16   – tokens visuais projetados
VOCAB_SIZE = cfg["VOCAB_SIZE"]   # 2048
MAX_SEQ    = cfg["MAX_SEQ"]      # 48   – comprimento máximo de texto
IMG_SIZE   = cfg["IMG_SIZE"]     # 112
TOTAL_POS  = VIS_TOKENS + MAX_SEQ  # 64  – posições totais (visual + texto)

# Tokens especiais – padrão BPE; ajuste se o tokenizer usar outros ids
BOS_ID = cfg.get("bos_token_id", 1)
EOS_ID = cfg.get("eos_token_id", 2)

# ── 3. Módulos auxiliares ─────────────────────────────────────────────────────

class CausalSelfAttention(nn.Module):
    def __init__(self):
        super().__init__()
        assert N_EMBD % N_HEAD == 0, "N_EMBD deve ser divisível por N_HEAD"
        self.qkv    = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False)
        self.proj   = nn.Linear(N_EMBD, N_EMBD,      bias=False)
        self.n_head = N_HEAD
        # Máscara causal de tamanho TOTAL_POS × TOTAL_POS
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(TOTAL_POS, TOTAL_POS))
              .view(1, 1, TOTAL_POS, TOTAL_POS)
        )

    def forward(self, x):
        B, T, C = x.shape
        nh, hs  = self.n_head, C // self.n_head
        q, k, v = self.qkv(x).split(C, dim=-1)
        q = q.view(B, T, nh, hs).transpose(1, 2)
        k = k.view(B, T, nh, hs).transpose(1, 2)
        v = v.view(B, T, nh, hs).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) * (hs ** -0.5)
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        y   = (att @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(y)


class MLP(nn.Module):
    """MLP com D_FF=256 (conforme config), NÃO 4×N_EMBD=512."""
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(N_EMBD, D_FF)   # 128 → 256
        self.fc2 = nn.Linear(D_FF, N_EMBD)   # 256 → 128

    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x)))


class Block(nn.Module):
    def __init__(self):
        super().__init__()
        self.ln1  = nn.LayerNorm(N_EMBD)
        self.attn = CausalSelfAttention()
        self.ln2  = nn.LayerNorm(N_EMBD)
        self.mlp  = MLP()

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


# ── 4. Encoder visual (CNN 3 camadas → VIS_CH=64) ────────────────────────────

class VisualEncoder(nn.Module):
    """
    Progride os canais: 3 → VIS_CH//4 → VIS_CH//2 → VIS_CH
                        3 →    16     →     32     →   64
    Cada conv tem stride=2, então 112×112 → 56×56 → 28×28 → 14×14.
    AdaptiveAvgPool2d(4,4) → 16 tokens (= N_VIS = 4×4).
    Linear 64 → 128 projeta para o espaço do transformer.
    """
    def __init__(self):
        super().__init__()
        c1 = VIS_CH // 4   # 16
        c2 = VIS_CH // 2   # 32
        c3 = VIS_CH        # 64

        self.conv1 = nn.Sequential(
            nn.Conv2d(3,  c1, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c1), nn.ReLU(inplace=True)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c2), nn.ReLU(inplace=True)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(c2, c3, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c3), nn.ReLU(inplace=True)
        )

        grid = int(VIS_TOKENS ** 0.5)   # 4  →  4×4 = 16 tokens
        self.pool = nn.AdaptiveAvgPool2d((grid, grid))
        self.proj = nn.Linear(c3, N_EMBD)   # 64 → 128

    def forward(self, x):
        x = self.conv3(self.conv2(self.conv1(x)))   # (B, 64, 14, 14)
        x = self.pool(x)                            # (B, 64,  4,  4)
        B, C, H, W = x.shape
        x = x.view(B, C, H * W).transpose(1, 2)    # (B, 16, 64)
        return self.proj(x)                          # (B, 16, 128)


# ── 5. Modelo completo MiniVLM ────────────────────────────────────────────────

class MiniVLM(nn.Module):
    def __init__(self):
        super().__init__()
        self.vis_enc = VisualEncoder()
        self.tok_emb = nn.Embedding(VOCAB_SIZE, N_EMBD)   # (2048, 128)
        self.pos_emb = nn.Embedding(TOTAL_POS,  N_EMBD)   # (64,   128)
        self.blocks  = nn.ModuleList([Block() for _ in range(N_LAYER)])
        self.ln_f    = nn.LayerNorm(N_EMBD)
        self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE, bias=False)  # weight-tied

    def forward(self, img_tokens: torch.Tensor, tok_ids: torch.Tensor):
        """
        img_tokens : (B, VIS_TOKENS, N_EMBD)
        tok_ids    : (B, T)   T ≤ MAX_SEQ
        """
        B, T = tok_ids.shape
        tok  = self.tok_emb(tok_ids)                               # (B, T, 128)
        seq  = torch.cat([img_tokens, tok], dim=1)                 # (B, 16+T, 128)
        pos  = self.pos_emb(
            torch.arange(VIS_TOKENS + T, device=tok_ids.device)
        )                                                           # (16+T, 128)
        x = seq + pos.unsqueeze(0)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        return self.lm_head(x)                                     # (B, 16+T, 2048)

    @torch.no_grad()
    def generate_beam(
        self,
        img: torch.Tensor,
        beam_width: int = 3,
        max_new: int = 48,
    ) -> str:
        self.eval()
        img_tokens = self.vis_enc(img)              # (1, 16, 128)

        # beam = (log_score, lista de token_ids)
        beams: list[tuple[float, list[int]]] = [(0.0, [BOS_ID])]

        for _ in range(max_new):
            candidates = []
            for score, seq in beams:
                if seq[-1] == EOS_ID:
                    candidates.append((score, seq))
                    continue

                ids     = torch.tensor([seq], dtype=torch.long, device=img.device)
                logits  = self.forward(img_tokens, ids)             # (1, 16+T, 2048)
                # logit do próximo token de texto (posição após os visuais)
                next_lg = logits[0, VIS_TOKENS + len(seq) - 1]
                lprobs  = F.log_softmax(next_lg, dim=-1)
                topk    = torch.topk(lprobs, beam_width)

                for lp, tok in zip(topk.values.tolist(), topk.indices.tolist()):
                    candidates.append((score + lp, seq + [tok]))

            # Mantém os top-k beams
            beams = sorted(candidates, key=lambda x: x[0], reverse=True)[:beam_width]

            if all(s[-1] == EOS_ID for _, s in beams):
                break

        best = beams[0][1]
        best = [t for t in best if t not in (BOS_ID, EOS_ID)]
        return tokenizer.decode(best)


# ── 6. Instanciar, carregar pesos ─────────────────────────────────────────────
model   = MiniVLM()
weights = load_file(ckpt_path, device=str(device))

missing, unexpected = model.load_state_dict(weights, strict=False)
if missing:
    print(f"[WARN] Pesos ausentes   : {missing}")
if unexpected:
    print(f"[WARN] Pesos inesperados: {unexpected}")

# Weight tying: lm_head compartilha a matriz de tok_emb
model.lm_head.weight = model.tok_emb.weight
model.to(device).eval()
print(f"[OK] Modelo carregado em {device}  |  params: {sum(p.numel() for p in model.parameters()):,}")

# ── 7. Pré-processamento ──────────────────────────────────────────────────────
transform = T.Compose([
    T.Resize((IMG_SIZE, IMG_SIZE)),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],
                [0.229, 0.224, 0.225]),
])

IMAGE_PATH = "image.png"     # ← put the real path

img   = Image.open(IMAGE_PATH).convert("RGB")
img_t = transform(img).unsqueeze(0).to(device)   # (1, 3, 112, 112)

# ── 8. Gerar legenda ──────────────────────────────────────────────────────────
caption = model.generate_beam(img_t, beam_width=3, max_new=48)
print("Legenda:", caption)

Generation strategies

Method	Call	Characteristic
Greedy	`model.generate_greedy(img)`	Fast, deterministic
Top-k sampling	`model.generate_topk(img, temperature=0.8, top_k=50)`	Diverse, creative
Beam search	`model.generate_beam(img, beam_width=3)`	Most fluent

Samples

Image	Caption
	lst vide oline lst concre ickly vide lst ween ffic pies ffic blic roup blic vement drin etrack roup fferent lst camoufla vide should slid hild oline ween atform shap fferent lst hicle ffic hild pies pool black white camoufla dr atform ween pherd blic
	ramp den m snow his should fferent oufla should lst umbrel leyball drin m woman - a ma
	. ground hind cell stad stad kitch bm colo bm Lar pies slid Lar dience should hicle pies carni vement etrack leyball vement stad Roll should oline toge bm leyball hild adul hild fferent toge dog a a in st - ed . . is

Limitations

This model is intentionally minimal. Expect:

Short, generic captions — "a dog is running in the grass" rather than rich descriptions
Repetition on out-of-distribution images
Limited vocabulary — 2048 BPE tokens covers Flickr8k well but not the full web
No instruction following — purely image → caption, no chat or Q&A
112×112 input resolution — fine detail is lost

It was trained on a single consumer GPU in under an hour. Manage expectations accordingly.

Roadmap

This model is a foundation. Natural next steps to scale it:

Replace CNN with a tiny ViT patch encoder
Cross-attention layers instead of prefix concatenation (à la Flamingo)
Pretrained frozen backbone (CLIP ViT-B/32)
Larger vocabulary (30k+ tokens via tiktoken)
Scale decoder to 6–12 layers, d=512+
Train on CC3M / LAION-400M
Scale SO much up

Repo contents

File	Description
`model.pt`	Best checkpoint (model + optimizer + scheduler state)
`tokenizer.json`	BPE tokenizer trained on Flickr8k captions
`config.json`	Architecture hyperparameters
`README.md`	This file

Citation

@misc{supravl-nano-900k,
  author       = {SupraLabs},
  title        = {SupraVL-Nano-900k: A Minimal Vision-Language Model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/supralabs/SupraVL-Nano-900k}},
}

License

Apache 2.0 — free to use, modify, and distribute with attribution.

Built with 🔬 by SupraLabs · small models, big ideas

Downloads last month: 3

Safetensors

Model size

961k params

Tensor type

F32

SupraLabs
/

SupraVL-Nano-900k