πŸ”¬ SupraVL-Nano-900k

A minimal, fully transparent Vision-Language Model trained from scratch

SupraLabs License Params Dataset Framework

The smallest VLM you can actually understand, every component built from scratch.


What is this?

SupraVL-Nano-900k is a tiny Vision-Language Model built entirely from scratch by SupraLabs as an educational reference implementation.

With only ~900k parameters, it demonstrates the core mechanics of modern VLMs β€” spatial visual tokenization, cross-modal attention, and autoregressive caption generation, in code that fits in a single Jupyter notebook.

This is not a production model. It is a transparent, readable blueprint for anyone who wants to understand how image-to-text models actually work under the hood.


Architecture

Image (3 Γ— 112 Γ— 112)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CNN Encoder                    β”‚
β”‚  3Γ— Conv-BN-GELU + MaxPool      β”‚
β”‚  AdaptiveAvgPool(4Γ—4)           β”‚
β”‚  β†’ 16 spatial tokens Γ— 64-d    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚  Linear projection (64 β†’ 128)
                 β–Ό
         16 Visual Tokens (128-d)
                 β”‚
                 β”‚  prepend to text sequence
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPT-2-style Transformer Decoderβ”‚
β”‚  3 layers Β· d=128 Β· 4 heads     β”‚
β”‚  FF dim=256 Β· causal masking     β”‚
β”‚  weight-tied LM head             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
         Caption tokens (BOS β†’ EOS)

Key design decisions

Choice Why
16 spatial tokens (4Γ—4 grid) instead of 1 global token The decoder can attend to different image regions when generating different words β€” closer to how real VLMs work
BPE tokenizer (2048 tokens, trained on Flickr8k) Handles real natural language; splits unknown words instead of failing
Prefix concatenation for vision-language fusion Simple and effective β€” visual tokens prepended to the text sequence, sharing the same attention mechanism
Weight tying (token emb ↔ LM head) Standard trick from GPT-2; reduces params and stabilizes training
Mixed precision (AMP) Full FP16/BF16 training on CUDA out of the box

Training

Setting Value
Dataset Flickr8k β€” 30k train / 5k val pairs
Epochs 15
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.01)
Learning rate 3e-4 β†’ cosine decay β†’ 3e-5
Batch size 64
Precision Mixed (AMP)
Gradient clipping 1.0
Loss Cross-entropy next-token prediction (teacher forcing)
Hardware Kaggle 2Γ— T4 / Google Colab T4

Inference

Quick start

"pip install torch torchvision pillow huggingface_hub safetensors tokenizers"


# ══════════════════════════════════════════════════════════════════════════════

import json, math
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
import torchvision.transforms as T
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from tokenizers import Tokenizer

# ── 1. Download de artefatos ──────────────────────────────────────────────────
REPO = "supralabs/SupraVL-Nano-900k"

ckpt_path = hf_hub_download(REPO, "model.safetensors")
tok_path  = hf_hub_download(REPO, "tokenizer.json")
cfg_path  = hf_hub_download(REPO, "config.json")

with open(cfg_path) as f:
    cfg = json.load(f)

tokenizer = Tokenizer.from_file(tok_path)
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Config carregado:", json.dumps(cfg, indent=2))

# ── 2. HiperparΓ’metros – mapeados das chaves REAIS do config.json ─────────────
#
#  config.json real:
#  { "D_MODEL":128, "N_HEADS":4, "N_LAYERS":3, "D_FF":256,
#    "VIS_CH":64,   "N_VIS":16,  "VOCAB_SIZE":2048,
#    "MAX_SEQ":48,  "IMG_SIZE":112 }
#
N_EMBD     = cfg["D_MODEL"]      # 128  – dimensΓ£o dos embeddings
N_HEAD     = cfg["N_HEADS"]      # 4    – cabeΓ§as de atenΓ§Γ£o
N_LAYER    = cfg["N_LAYERS"]     # 3    – blocos transformer
D_FF       = cfg["D_FF"]         # 256  – dim interna do MLP (β‰  4Γ—N_EMBD!)
VIS_CH     = cfg["VIS_CH"]       # 64   – canais de saΓ­da da CNN visual
VIS_TOKENS = cfg["N_VIS"]        # 16   – tokens visuais projetados
VOCAB_SIZE = cfg["VOCAB_SIZE"]   # 2048
MAX_SEQ    = cfg["MAX_SEQ"]      # 48   – comprimento mΓ‘ximo de texto
IMG_SIZE   = cfg["IMG_SIZE"]     # 112
TOTAL_POS  = VIS_TOKENS + MAX_SEQ  # 64  – posiΓ§Γ΅es totais (visual + texto)

# Tokens especiais – padrΓ£o BPE; ajuste se o tokenizer usar outros ids
BOS_ID = cfg.get("bos_token_id", 1)
EOS_ID = cfg.get("eos_token_id", 2)

# ── 3. MΓ³dulos auxiliares ─────────────────────────────────────────────────────

class CausalSelfAttention(nn.Module):
    def __init__(self):
        super().__init__()
        assert N_EMBD % N_HEAD == 0, "N_EMBD deve ser divisΓ­vel por N_HEAD"
        self.qkv    = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False)
        self.proj   = nn.Linear(N_EMBD, N_EMBD,      bias=False)
        self.n_head = N_HEAD
        # MΓ‘scara causal de tamanho TOTAL_POS Γ— TOTAL_POS
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(TOTAL_POS, TOTAL_POS))
              .view(1, 1, TOTAL_POS, TOTAL_POS)
        )

    def forward(self, x):
        B, T, C = x.shape
        nh, hs  = self.n_head, C // self.n_head
        q, k, v = self.qkv(x).split(C, dim=-1)
        q = q.view(B, T, nh, hs).transpose(1, 2)
        k = k.view(B, T, nh, hs).transpose(1, 2)
        v = v.view(B, T, nh, hs).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) * (hs ** -0.5)
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        y   = (att @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(y)


class MLP(nn.Module):
    """MLP com D_FF=256 (conforme config), NÃO 4×N_EMBD=512."""
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(N_EMBD, D_FF)   # 128 β†’ 256
        self.fc2 = nn.Linear(D_FF, N_EMBD)   # 256 β†’ 128

    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x)))


class Block(nn.Module):
    def __init__(self):
        super().__init__()
        self.ln1  = nn.LayerNorm(N_EMBD)
        self.attn = CausalSelfAttention()
        self.ln2  = nn.LayerNorm(N_EMBD)
        self.mlp  = MLP()

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


# ── 4. Encoder visual (CNN 3 camadas β†’ VIS_CH=64) ────────────────────────────

class VisualEncoder(nn.Module):
    """
    Progride os canais: 3 β†’ VIS_CH//4 β†’ VIS_CH//2 β†’ VIS_CH
                        3 β†’    16     β†’     32     β†’   64
    Cada conv tem stride=2, entΓ£o 112Γ—112 β†’ 56Γ—56 β†’ 28Γ—28 β†’ 14Γ—14.
    AdaptiveAvgPool2d(4,4) β†’ 16 tokens (= N_VIS = 4Γ—4).
    Linear 64 β†’ 128 projeta para o espaΓ§o do transformer.
    """
    def __init__(self):
        super().__init__()
        c1 = VIS_CH // 4   # 16
        c2 = VIS_CH // 2   # 32
        c3 = VIS_CH        # 64

        self.conv1 = nn.Sequential(
            nn.Conv2d(3,  c1, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c1), nn.ReLU(inplace=True)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c2), nn.ReLU(inplace=True)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(c2, c3, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(c3), nn.ReLU(inplace=True)
        )

        grid = int(VIS_TOKENS ** 0.5)   # 4  β†’  4Γ—4 = 16 tokens
        self.pool = nn.AdaptiveAvgPool2d((grid, grid))
        self.proj = nn.Linear(c3, N_EMBD)   # 64 β†’ 128

    def forward(self, x):
        x = self.conv3(self.conv2(self.conv1(x)))   # (B, 64, 14, 14)
        x = self.pool(x)                            # (B, 64,  4,  4)
        B, C, H, W = x.shape
        x = x.view(B, C, H * W).transpose(1, 2)    # (B, 16, 64)
        return self.proj(x)                          # (B, 16, 128)


# ── 5. Modelo completo MiniVLM ────────────────────────────────────────────────

class MiniVLM(nn.Module):
    def __init__(self):
        super().__init__()
        self.vis_enc = VisualEncoder()
        self.tok_emb = nn.Embedding(VOCAB_SIZE, N_EMBD)   # (2048, 128)
        self.pos_emb = nn.Embedding(TOTAL_POS,  N_EMBD)   # (64,   128)
        self.blocks  = nn.ModuleList([Block() for _ in range(N_LAYER)])
        self.ln_f    = nn.LayerNorm(N_EMBD)
        self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE, bias=False)  # weight-tied

    def forward(self, img_tokens: torch.Tensor, tok_ids: torch.Tensor):
        """
        img_tokens : (B, VIS_TOKENS, N_EMBD)
        tok_ids    : (B, T)   T ≀ MAX_SEQ
        """
        B, T = tok_ids.shape
        tok  = self.tok_emb(tok_ids)                               # (B, T, 128)
        seq  = torch.cat([img_tokens, tok], dim=1)                 # (B, 16+T, 128)
        pos  = self.pos_emb(
            torch.arange(VIS_TOKENS + T, device=tok_ids.device)
        )                                                           # (16+T, 128)
        x = seq + pos.unsqueeze(0)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        return self.lm_head(x)                                     # (B, 16+T, 2048)

    @torch.no_grad()
    def generate_beam(
        self,
        img: torch.Tensor,
        beam_width: int = 3,
        max_new: int = 48,
    ) -> str:
        self.eval()
        img_tokens = self.vis_enc(img)              # (1, 16, 128)

        # beam = (log_score, lista de token_ids)
        beams: list[tuple[float, list[int]]] = [(0.0, [BOS_ID])]

        for _ in range(max_new):
            candidates = []
            for score, seq in beams:
                if seq[-1] == EOS_ID:
                    candidates.append((score, seq))
                    continue

                ids     = torch.tensor([seq], dtype=torch.long, device=img.device)
                logits  = self.forward(img_tokens, ids)             # (1, 16+T, 2048)
                # logit do prΓ³ximo token de texto (posiΓ§Γ£o apΓ³s os visuais)
                next_lg = logits[0, VIS_TOKENS + len(seq) - 1]
                lprobs  = F.log_softmax(next_lg, dim=-1)
                topk    = torch.topk(lprobs, beam_width)

                for lp, tok in zip(topk.values.tolist(), topk.indices.tolist()):
                    candidates.append((score + lp, seq + [tok]))

            # MantΓ©m os top-k beams
            beams = sorted(candidates, key=lambda x: x[0], reverse=True)[:beam_width]

            if all(s[-1] == EOS_ID for _, s in beams):
                break

        best = beams[0][1]
        best = [t for t in best if t not in (BOS_ID, EOS_ID)]
        return tokenizer.decode(best)


# ── 6. Instanciar, carregar pesos ─────────────────────────────────────────────
model   = MiniVLM()
weights = load_file(ckpt_path, device=str(device))

missing, unexpected = model.load_state_dict(weights, strict=False)
if missing:
    print(f"[WARN] Pesos ausentes   : {missing}")
if unexpected:
    print(f"[WARN] Pesos inesperados: {unexpected}")

# Weight tying: lm_head compartilha a matriz de tok_emb
model.lm_head.weight = model.tok_emb.weight
model.to(device).eval()
print(f"[OK] Modelo carregado em {device}  |  params: {sum(p.numel() for p in model.parameters()):,}")

# ── 7. PrΓ©-processamento ──────────────────────────────────────────────────────
transform = T.Compose([
    T.Resize((IMG_SIZE, IMG_SIZE)),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],
                [0.229, 0.224, 0.225]),
])

IMAGE_PATH = "image.png"     # ← put the real path

img   = Image.open(IMAGE_PATH).convert("RGB")
img_t = transform(img).unsqueeze(0).to(device)   # (1, 3, 112, 112)

# ── 8. Gerar legenda ──────────────────────────────────────────────────────────
caption = model.generate_beam(img_t, beam_width=3, max_new=48)
print("Legenda:", caption)

Generation strategies

Method Call Characteristic
Greedy model.generate_greedy(img) Fast, deterministic
Top-k sampling model.generate_topk(img, temperature=0.8, top_k=50) Diverse, creative
Beam search model.generate_beam(img, beam_width=3) Most fluent

Samples

Image Caption
image lst vide oline lst concre ickly vide lst ween ffic pies ffic blic roup blic vement drin etrack roup fferent lst camoufla vide should slid hild oline ween atform shap fferent lst hicle ffic hild pies pool black white camoufla dr atform ween pherd blic
image ramp den m snow his should fferent oufla should lst umbrel leyball drin m woman - a ma
image . ground hind cell stad stad kitch bm colo bm Lar pies slid Lar dience should hicle pies carni vement etrack leyball vement stad Roll should oline toge bm leyball hild adul hild fferent toge dog a a in st - ed . . is

Limitations

This model is intentionally minimal. Expect:

  • Short, generic captions β€” "a dog is running in the grass" rather than rich descriptions
  • Repetition on out-of-distribution images
  • Limited vocabulary β€” 2048 BPE tokens covers Flickr8k well but not the full web
  • No instruction following β€” purely image β†’ caption, no chat or Q&A
  • 112Γ—112 input resolution β€” fine detail is lost

It was trained on a single consumer GPU in under an hour. Manage expectations accordingly.


Roadmap

This model is a foundation. Natural next steps to scale it:

  • Replace CNN with a tiny ViT patch encoder
  • Cross-attention layers instead of prefix concatenation (Γ  la Flamingo)
  • Pretrained frozen backbone (CLIP ViT-B/32)
  • Larger vocabulary (30k+ tokens via tiktoken)
  • Scale decoder to 6–12 layers, d=512+
  • Train on CC3M / LAION-400M
  • Scale SO much up

Repo contents

File Description
model.pt Best checkpoint (model + optimizer + scheduler state)
tokenizer.json BPE tokenizer trained on Flickr8k captions
config.json Architecture hyperparameters
README.md This file

Citation

@misc{supravl-nano-900k,
  author       = {SupraLabs},
  title        = {SupraVL-Nano-900k: A Minimal Vision-Language Model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/supralabs/SupraVL-Nano-900k}},
}

License

Apache 2.0 β€” free to use, modify, and distribute with attribution.


Built with πŸ”¬ by SupraLabs Β· small models, big ideas

Downloads last month
3
Safetensors
Model size
961k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train SupraLabs/SupraVL-Nano-900k