YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
QKVAE-1m is a 1.06 million parameter quantizing convolutional autoencoder that compresses 96×96 RGB images into a 48×48 grid of 2,304 discrete visual tokens. The tokens come from a 15,360-code finite-scalar-quantization (FSQ) codebook. The point is to graft them onto a small language model's vocabulary as ordinary integer ids, so a 1M language model can see images without an external encoder, adapter, or vision tower.
We are actively working to make this way way way more efficient. The 1m here is a checkpoint on the way to a much smaller, much faster version.
Notice
QKVAE-1m is a research prototype. Trained on STL-10 (96×96 natural images of animals, vehicles, and similar subjects). Out-of-distribution inputs like text, faces, diagrams, very large structures, and very small structures will reconstruct with the characteristic STL-10 "vibe." Use accordingly. Pair with a real language model before drawing strong conclusions.
| Property | Value |
|---|---|
| Parameters | 1,063,304 trainable (1.06M) |
| Architecture | CNN encoder + FSQ + CNN decoder |
| Image Size | 96 × 96 × 3 (RGB) |
| Tokens per Image | 2,304 (48 × 48 grid) |
| Codebook Size | 15,360 (8 × 8 × 8 × 6 × 5 levels) |
| Latent Dim | 5 |
| Compression | 6.9× (27,648 → ~4,001 bytes) |
| Training Dataset | STL-10 train (5,000 images, 96×96) |
| Test Dataset | STL-10 test (8,000 images, 96×96) |
| PSNR (test) | 32.61 dB |
| SSIM (test) | 0.9743 |
| Training Steps | 6,000 |
| Training Time | ~1.3 minutes on a single RTX 5090 |
| Precision | bfloat16 autocast, fp32 weights |
| Output Range | tanh → clamped to [−1, 1] |
The QKVAE was built to answer one question: how small can a quantizing autoencoder get before reconstruction quality collapses, and is the result still useful as a discrete image tokenizer for a tiny language model? For 96×96 RGB on STL-10 the answer is 1.06 million parameters at 32.61 dB PSNR, and the 2,304 code tokens slot into a small language model's vocabulary as ordinary integer ids, with no vision tower required.
The training loop is deliberately ordinary. AdamW with OneCycleLR. bfloat16 autocast. L1 + 0.25×MSE + 0.5×Sobel-edge loss. 6,000 steps at batch 32 on the STL-10 train split, evaluated on the held-out test split. No EMA, no GAN, no perceptual loss, no discriminator, no augmentations beyond the natural STL-10 diversity. The only thing that makes this work is the FSQ codebook being large enough (15,360) to cover the STL-10 manifold and small enough that the gradient is informative end-to-end.
The 1m here is the first checkpoint. We are actively working to make this way way way more efficient. Smaller architecture, fewer codebook entries, less compute per forward, and lower token counts per image. The 1m is published because it works today, and because every smaller version we try will be measured against it.
| Source | Path | Notes |
|---|---|---|
| Glint Research logo | logo.png |
Text/edge content; expect blurry typography. |
| Futuristic cityscape | cityscape.png |
Architectural structure; expect pastel impressionism. |
| Natural landscape | landscape.png |
Outdoor texture; closest to the training distribution. |
Glint Research logo. Out-of-distribution text and sharp edges. Reconstruction reflects the STL-10 prior.
Futuristic cityscape. Out-of-distribution architecture and high-frequency structure.
Natural landscape. In-distribution texture; expect the highest fidelity of the three.
Yes, thats a portal 2 reference
| Parameter | Value |
|---|---|
| Architecture | Quantizing Convolutional Autoencoder |
| Stem | 3 → 104 channels, 3×3 conv, padding 1 |
| Encoder Downsample | 1× 4×4 stride-2 conv (96 → 48) |
| Encoder ResBlocks | 2 × (GroupNorm → SiLU → 3×3 conv → GroupNorm → SiLU → 3×3 conv + residual), 104 channels |
| Encoder Head | GroupNorm → SiLU → 1×1 conv to 5 latent channels |
| Quantizer | Finite Scalar Quantization, levels (8, 8, 8, 6, 5) |
| Codebook Size | 15,360 |
| Decoder Input | 3×3 conv 5 → 104 |
| Decoder ResBlocks | 2 × ResBlock(104) |
| Decoder Upsample | 1× nearest-neighbour 2× → 3×3 conv 104 → 104 → SiLU |
| Decoder Head | GroupNorm → SiLU → 3×3 conv 104 → 3, tanh |
| Normalization | GroupNorm (8 groups, 104 channels) |
| Activation | SiLU |
| Output Range | tanh, no clamping outside the model |
| Parameter | Value |
|---|---|
| Dataset | STL-10 train, 96×96, 5,000 images |
| Test Set | STL-10 test, 96×96, 8,000 images |
| Augmentation | None (dataset is naturally diverse at 96×96) |
| Steps | 6,000 |
| Batch Size | 32 |
| Optimizer | AdamW, β=(0.9, 0.95), weight decay 1e-4 |
| Learning Rate | 3e-4 peak, OneCycleLR, 5% warmup, cosine decay |
| Loss | L1 + 0.25 × MSE + 0.5 × L1(∇Sobel(recon), ∇Sobel(target)) |
| Precision | bfloat16 autocast, fp32 reductions |
| Gradient Clip | 1.0 |
| Hardware | NVIDIA RTX 5090, 32 GB |
| Wall Time | 1.3 minutes (78 seconds) |
| Throughput | ~77 steps/second |
Loss was logged throughout training. Selected values are reported below. The encoder/FSQ/decoder train jointly with no warmup, no staged unfreezing.
| Step | Loss |
|---|---|
| 0 | 0.3989 |
| 1,000 | 0.2211 |
| 2,000 | 0.1610 |
| 3,000 | 0.1248 |
| 4,000 | 0.1124 |
| 5,000 | 0.1033 |
| 6,000 | 0.0987 |
How did it get loss .39 at step 0? We have no clue
import torch
from modeling import load_qkvae, reconstruction_psnr
from inference import load_image
from png_io import save_png
model = load_qkvae("model.safetensors") # 1.06M params, ~4 MB
image = load_image("examples/cityscape.png") # (3, 96, 96) in [-1, 1]
with torch.no_grad():
indices = model.encode(image.unsqueeze(0)) # (1, 2304) int64, range 0..15359
recon = model.decode(indices)[0] # (3, 96, 96) in [-1, 1]
save_png(recon, "out.png")
print(f"PSNR: {reconstruction_psnr(image, recon):.2f} dB")
python inference.py path/to/image.png model.safetensors out.png
# image=path/to/image.png size=96 tokens=2304 codebook=15360 psnr=29.41dB -> out.png
The 2,304 code tokens are just integers in 0..15,359. To make a language model see images, append the 15,360 code ids plus 3 special ids (<img_start>, <img_end>, <img_newrow>) to its vocabulary, project each code's dequantized FSQ vector into a learnable embedding, and add a small amount of training data mixing image and text tokens. The result is a single model that handles both modalities. No vision tower, no cross-attention.
| File | Description |
|---|---|
README.md |
This card |
config.json |
Architecture spec and reported metrics |
model.safetensors |
1.06M trainable parameters, fp32, ~4.1 MB |
modeling.py |
Self-contained QKVAE, FSQ, load_qkvae |
inference.py |
CLI + helpers for image reconstruction |
png_io.py |
stdlib PNG reader/writer (no Pillow or torchvision) |
requirements.txt |
torch, safetensors. Nothing else. |
examples/ |
Sample images (user-supplied, see examples/README.md) |
sample_grid.png |
16×2 grid of STL-10 originals and 1m reconstructions |
- Fixed resolution. Native 96×96. Other resolutions require resizing; aspect ratio is not preserved.
- Fixed vocabulary. 15,360 FSQ codes cover STL-10 at this scale. They are not enough for ImageNet, faces, or text legibly.
- Distribution. Trained on STL-10 train (5,000 images). Faces, text, architecture, and diagrams will come out looking like STL-10 subjects. This is the expected behaviour of a 1M autoencoder.
- No perceptual loss. L1 + MSE + edge are honest, blunt metrics. Larger models trained with LPIPS or DISTS produce more pleasing results. We skipped those to keep a clean ablation of scale.
- No FID / IS. Evaluation is PSNR + SSIM + visual inspection. The grid above is the only qualitative evidence.
- Not for production. Research artifact.
@misc{qkvae1m2026,
author = {Glint Research},
title = {QKVAE-1m: A 1.06M-Parameter Quantizing Autoencoder for 96x96 Image Tokens},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Glint-Research/QKVAE-1m}
}
Built by Glint Research. Small models trying their best since 2026.
- Downloads last month
- -
