YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Hugging Face Discord Ko-fi


QKVAE-1m is a 1.06 million parameter quantizing convolutional autoencoder that compresses 96×96 RGB images into a 48×48 grid of 2,304 discrete visual tokens. The tokens come from a 15,360-code finite-scalar-quantization (FSQ) codebook. The point is to graft them onto a small language model's vocabulary as ordinary integer ids, so a 1M language model can see images without an external encoder, adapter, or vision tower.

We are actively working to make this way way way more efficient. The 1m here is a checkpoint on the way to a much smaller, much faster version.

Notice

QKVAE-1m is a research prototype. Trained on STL-10 (96×96 natural images of animals, vehicles, and similar subjects). Out-of-distribution inputs like text, faces, diagrams, very large structures, and very small structures will reconstruct with the characteristic STL-10 "vibe." Use accordingly. Pair with a real language model before drawing strong conclusions.


Property Value
Parameters 1,063,304 trainable (1.06M)
Architecture CNN encoder + FSQ + CNN decoder
Image Size 96 × 96 × 3 (RGB)
Tokens per Image 2,304 (48 × 48 grid)
Codebook Size 15,360 (8 × 8 × 8 × 6 × 5 levels)
Latent Dim 5
Compression 6.9× (27,648 → ~4,001 bytes)
Training Dataset STL-10 train (5,000 images, 96×96)
Test Dataset STL-10 test (8,000 images, 96×96)
PSNR (test) 32.61 dB
SSIM (test) 0.9743
Training Steps 6,000
Training Time ~1.3 minutes on a single RTX 5090
Precision bfloat16 autocast, fp32 weights
Output Range tanh → clamped to [−1, 1]

The QKVAE was built to answer one question: how small can a quantizing autoencoder get before reconstruction quality collapses, and is the result still useful as a discrete image tokenizer for a tiny language model? For 96×96 RGB on STL-10 the answer is 1.06 million parameters at 32.61 dB PSNR, and the 2,304 code tokens slot into a small language model's vocabulary as ordinary integer ids, with no vision tower required.

The training loop is deliberately ordinary. AdamW with OneCycleLR. bfloat16 autocast. L1 + 0.25×MSE + 0.5×Sobel-edge loss. 6,000 steps at batch 32 on the STL-10 train split, evaluated on the held-out test split. No EMA, no GAN, no perceptual loss, no discriminator, no augmentations beyond the natural STL-10 diversity. The only thing that makes this work is the FSQ codebook being large enough (15,360) to cover the STL-10 manifold and small enough that the gradient is informative end-to-end.

The 1m here is the first checkpoint. We are actively working to make this way way way more efficient. Smaller architecture, fewer codebook entries, less compute per forward, and lower token counts per image. The 1m is published because it works today, and because every smaller version we try will be measured against it.


Source Path Notes
Glint Research logo logo.png Text/edge content; expect blurry typography.
Futuristic cityscape cityscape.png Architectural structure; expect pastel impressionism.
Natural landscape landscape.png Outdoor texture; closest to the training distribution.

Logo reconstruction Glint Research logo. Out-of-distribution text and sharp edges. Reconstruction reflects the STL-10 prior.

Cityscape reconstruction Futuristic cityscape. Out-of-distribution architecture and high-frequency structure.

Landscape reconstruction Natural landscape. In-distribution texture; expect the highest fidelity of the three. Yes, thats a portal 2 reference


Parameter Value
Architecture Quantizing Convolutional Autoencoder
Stem 3 → 104 channels, 3×3 conv, padding 1
Encoder Downsample 1× 4×4 stride-2 conv (96 → 48)
Encoder ResBlocks 2 × (GroupNorm → SiLU → 3×3 conv → GroupNorm → SiLU → 3×3 conv + residual), 104 channels
Encoder Head GroupNorm → SiLU → 1×1 conv to 5 latent channels
Quantizer Finite Scalar Quantization, levels (8, 8, 8, 6, 5)
Codebook Size 15,360
Decoder Input 3×3 conv 5 → 104
Decoder ResBlocks 2 × ResBlock(104)
Decoder Upsample 1× nearest-neighbour 2× → 3×3 conv 104 → 104 → SiLU
Decoder Head GroupNorm → SiLU → 3×3 conv 104 → 3, tanh
Normalization GroupNorm (8 groups, 104 channels)
Activation SiLU
Output Range tanh, no clamping outside the model

Parameter Value
Dataset STL-10 train, 96×96, 5,000 images
Test Set STL-10 test, 96×96, 8,000 images
Augmentation None (dataset is naturally diverse at 96×96)
Steps 6,000
Batch Size 32
Optimizer AdamW, β=(0.9, 0.95), weight decay 1e-4
Learning Rate 3e-4 peak, OneCycleLR, 5% warmup, cosine decay
Loss L1 + 0.25 × MSE + 0.5 × L1(∇Sobel(recon), ∇Sobel(target))
Precision bfloat16 autocast, fp32 reductions
Gradient Clip 1.0
Hardware NVIDIA RTX 5090, 32 GB
Wall Time 1.3 minutes (78 seconds)
Throughput ~77 steps/second

Loss was logged throughout training. Selected values are reported below. The encoder/FSQ/decoder train jointly with no warmup, no staged unfreezing.

Step Loss
0 0.3989
1,000 0.2211
2,000 0.1610
3,000 0.1248
4,000 0.1124
5,000 0.1033
6,000 0.0987

How did it get loss .39 at step 0? We have no clue

import torch
from modeling import load_qkvae, reconstruction_psnr
from inference import load_image
from png_io import save_png

model = load_qkvae("model.safetensors")            # 1.06M params, ~4 MB

image = load_image("examples/cityscape.png")       # (3, 96, 96) in [-1, 1]

with torch.no_grad():
    indices = model.encode(image.unsqueeze(0))      # (1, 2304) int64, range 0..15359
    recon   = model.decode(indices)[0]              # (3, 96, 96) in [-1, 1]

save_png(recon, "out.png")
print(f"PSNR: {reconstruction_psnr(image, recon):.2f} dB")

python inference.py path/to/image.png model.safetensors out.png
# image=path/to/image.png size=96 tokens=2304 codebook=15360 psnr=29.41dB -> out.png

The 2,304 code tokens are just integers in 0..15,359. To make a language model see images, append the 15,360 code ids plus 3 special ids (<img_start>, <img_end>, <img_newrow>) to its vocabulary, project each code's dequantized FSQ vector into a learnable embedding, and add a small amount of training data mixing image and text tokens. The result is a single model that handles both modalities. No vision tower, no cross-attention.


File Description
README.md This card
config.json Architecture spec and reported metrics
model.safetensors 1.06M trainable parameters, fp32, ~4.1 MB
modeling.py Self-contained QKVAE, FSQ, load_qkvae
inference.py CLI + helpers for image reconstruction
png_io.py stdlib PNG reader/writer (no Pillow or torchvision)
requirements.txt torch, safetensors. Nothing else.
examples/ Sample images (user-supplied, see examples/README.md)
sample_grid.png 16×2 grid of STL-10 originals and 1m reconstructions

  • Fixed resolution. Native 96×96. Other resolutions require resizing; aspect ratio is not preserved.
  • Fixed vocabulary. 15,360 FSQ codes cover STL-10 at this scale. They are not enough for ImageNet, faces, or text legibly.
  • Distribution. Trained on STL-10 train (5,000 images). Faces, text, architecture, and diagrams will come out looking like STL-10 subjects. This is the expected behaviour of a 1M autoencoder.
  • No perceptual loss. L1 + MSE + edge are honest, blunt metrics. Larger models trained with LPIPS or DISTS produce more pleasing results. We skipped those to keep a clean ablation of scale.
  • No FID / IS. Evaluation is PSNR + SSIM + visual inspection. The grid above is the only qualitative evidence.
  • Not for production. Research artifact.

@misc{qkvae1m2026,
  author    = {Glint Research},
  title     = {QKVAE-1m: A 1.06M-Parameter Quantizing Autoencoder for 96x96 Image Tokens},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Glint-Research/QKVAE-1m}
}

Built by Glint Research. Small models trying their best since 2026.

Downloads last month
-
Safetensors
Model size
1.06M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Glint-Research/QKVAE-1M-1