YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

QKVAE-1m is a 1.06 million parameter quantizing convolutional autoencoder that compresses 96×96 RGB images into a 48×48 grid of 2,304 discrete visual tokens. The tokens come from a 15,360-code finite-scalar-quantization (FSQ) codebook. The point is to graft them onto a small language model's vocabulary as ordinary integer ids, so a 1M language model can see images without an external encoder, adapter, or vision tower.

We are actively working to make this way way way more efficient. The 1m here is a checkpoint on the way to a much smaller, much faster version.

Notice

QKVAE-1m is a research prototype. Trained on STL-10 (96×96 natural images of animals, vehicles, and similar subjects). Out-of-distribution inputs like text, faces, diagrams, very large structures, and very small structures will reconstruct with the characteristic STL-10 "vibe." Use accordingly. Pair with a real language model before drawing strong conclusions.

Property	Value
Parameters	1,063,304 trainable (1.06M)
Architecture	CNN encoder + FSQ + CNN decoder
Image Size	96 × 96 × 3 (RGB)
Tokens per Image	2,304 (48 × 48 grid)
Codebook Size	15,360 (8 × 8 × 8 × 6 × 5 levels)
Latent Dim	5
Compression	6.9× (27,648 → ~4,001 bytes)
Training Dataset	STL-10 train (5,000 images, 96×96)
Test Dataset	STL-10 test (8,000 images, 96×96)
PSNR (test)	32.61 dB
SSIM (test)	0.9743
Training Steps	6,000
Training Time	~1.3 minutes on a single RTX 5090
Precision	bfloat16 autocast, fp32 weights
Output Range	tanh → clamped to [−1, 1]

The QKVAE was built to answer one question: how small can a quantizing autoencoder get before reconstruction quality collapses, and is the result still useful as a discrete image tokenizer for a tiny language model? For 96×96 RGB on STL-10 the answer is 1.06 million parameters at 32.61 dB PSNR, and the 2,304 code tokens slot into a small language model's vocabulary as ordinary integer ids, with no vision tower required.

The training loop is deliberately ordinary. AdamW with OneCycleLR. bfloat16 autocast. L1 + 0.25×MSE + 0.5×Sobel-edge loss. 6,000 steps at batch 32 on the STL-10 train split, evaluated on the held-out test split. No EMA, no GAN, no perceptual loss, no discriminator, no augmentations beyond the natural STL-10 diversity. The only thing that makes this work is the FSQ codebook being large enough (15,360) to cover the STL-10 manifold and small enough that the gradient is informative end-to-end.

The 1m here is the first checkpoint. We are actively working to make this way way way more efficient. Smaller architecture, fewer codebook entries, less compute per forward, and lower token counts per image. The 1m is published because it works today, and because every smaller version we try will be measured against it.

Source	Path	Notes
Glint Research logo	`logo.png`	Text/edge content; expect blurry typography.
Futuristic cityscape	`cityscape.png`	Architectural structure; expect pastel impressionism.
Natural landscape	`landscape.png`	Outdoor texture; closest to the training distribution.

Glint Research logo. Out-of-distribution text and sharp edges. Reconstruction reflects the STL-10 prior.

Futuristic cityscape. Out-of-distribution architecture and high-frequency structure.

Natural landscape. In-distribution texture; expect the highest fidelity of the three. Yes, thats a portal 2 reference

Parameter	Value
Architecture	Quantizing Convolutional Autoencoder
Stem	3 → 104 channels, 3×3 conv, padding 1
Encoder Downsample	1× 4×4 stride-2 conv (96 → 48)
Encoder ResBlocks	2 × (GroupNorm → SiLU → 3×3 conv → GroupNorm → SiLU → 3×3 conv + residual), 104 channels
Encoder Head	GroupNorm → SiLU → 1×1 conv to 5 latent channels
Quantizer	Finite Scalar Quantization, levels (8, 8, 8, 6, 5)
Codebook Size	15,360
Decoder Input	3×3 conv 5 → 104
Decoder ResBlocks	2 × ResBlock(104)
Decoder Upsample	1× nearest-neighbour 2× → 3×3 conv 104 → 104 → SiLU
Decoder Head	GroupNorm → SiLU → 3×3 conv 104 → 3, tanh
Normalization	GroupNorm (8 groups, 104 channels)
Activation	SiLU
Output Range	tanh, no clamping outside the model

Parameter	Value
Dataset	STL-10 train, 96×96, 5,000 images
Test Set	STL-10 test, 96×96, 8,000 images
Augmentation	None (dataset is naturally diverse at 96×96)
Steps	6,000
Batch Size	32
Optimizer	AdamW, β=(0.9, 0.95), weight decay 1e-4
Learning Rate	3e-4 peak, OneCycleLR, 5% warmup, cosine decay
Loss	L1 + 0.25 × MSE + 0.5 × L1(∇Sobel(recon), ∇Sobel(target))
Precision	bfloat16 autocast, fp32 reductions
Gradient Clip	1.0
Hardware	NVIDIA RTX 5090, 32 GB
Wall Time	1.3 minutes (78 seconds)
Throughput	~77 steps/second

Loss was logged throughout training. Selected values are reported below. The encoder/FSQ/decoder train jointly with no warmup, no staged unfreezing.

Step	Loss
0	0.3989
1,000	0.2211
2,000	0.1610
3,000	0.1248
4,000	0.1124
5,000	0.1033
6,000	0.0987

How did it get loss .39 at step 0? We have no clue

import torch
from modeling import load_qkvae, reconstruction_psnr
from inference import load_image
from png_io import save_png

model = load_qkvae("model.safetensors")            # 1.06M params, ~4 MB

image = load_image("examples/cityscape.png")       # (3, 96, 96) in [-1, 1]

with torch.no_grad():
    indices = model.encode(image.unsqueeze(0))      # (1, 2304) int64, range 0..15359
    recon   = model.decode(indices)[0]              # (3, 96, 96) in [-1, 1]

save_png(recon, "out.png")
print(f"PSNR: {reconstruction_psnr(image, recon):.2f} dB")

python inference.py path/to/image.png model.safetensors out.png
# image=path/to/image.png size=96 tokens=2304 codebook=15360 psnr=29.41dB -> out.png

The 2,304 code tokens are just integers in 0..15,359. To make a language model see images, append the 15,360 code ids plus 3 special ids (<img_start>, <img_end>, <img_newrow>) to its vocabulary, project each code's dequantized FSQ vector into a learnable embedding, and add a small amount of training data mixing image and text tokens. The result is a single model that handles both modalities. No vision tower, no cross-attention.

File	Description
`README.md`	This card
`config.json`	Architecture spec and reported metrics
`model.safetensors`	1.06M trainable parameters, fp32, ~4.1 MB
`modeling.py`	Self-contained `QKVAE`, `FSQ`, `load_qkvae`
`inference.py`	CLI + helpers for image reconstruction
`png_io.py`	stdlib PNG reader/writer (no Pillow or torchvision)
`requirements.txt`	`torch`, `safetensors`. Nothing else.
`examples/`	Sample images (user-supplied, see `examples/README.md`)
`sample_grid.png`	16×2 grid of STL-10 originals and 1m reconstructions

Fixed resolution. Native 96×96. Other resolutions require resizing; aspect ratio is not preserved.
Fixed vocabulary. 15,360 FSQ codes cover STL-10 at this scale. They are not enough for ImageNet, faces, or text legibly.
Distribution. Trained on STL-10 train (5,000 images). Faces, text, architecture, and diagrams will come out looking like STL-10 subjects. This is the expected behaviour of a 1M autoencoder.
No perceptual loss. L1 + MSE + edge are honest, blunt metrics. Larger models trained with LPIPS or DISTS produce more pleasing results. We skipped those to keep a clean ablation of scale.
No FID / IS. Evaluation is PSNR + SSIM + visual inspection. The grid above is the only qualitative evidence.
Not for production. Research artifact.

@misc{qkvae1m2026,
  author    = {Glint Research},
  title     = {QKVAE-1m: A 1.06M-Parameter Quantizing Autoencoder for 96x96 Image Tokens},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Glint-Research/QKVAE-1m}
}

Built by Glint Research. Small models trying their best since 2026.

Downloads last month: -

Safetensors

Model size

1.06M params

Tensor type

I64

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Glint-Research/QKVAE-1M-1

QKVAE models

Collection

All of our QKVAE models. • 2 items • Updated about 3 hours ago • 1