ViT-GPT2 Image Captioning β€” ONNX

ONNX export of nlpconnect/vit-gpt2-image-captioning β€” a classic ViT encoder + GPT-2 decoder image captioner. ~240M parameters, trained on COCO captions.

Lightweight baseline captioner. Florence-2 is the better default for new projects (smaller, more capable, multi-task), but this one is useful when you need a vanilla "describe this image in one sentence" with minimal dependencies.

Converted artifact. Training credit: nlpconnect.

What this repo contains

config.json
generation_config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json

encoder_model.onnx          # ViT image encoder
decoder_model.onnx          # GPT-2 autoregressive decoder

Total: ~1.1 GB at fp32. Load with optimum.onnxruntime.ORTModelForVision2Seq.

How it was produced

optimum-cli export onnx \
    --model nlpconnect/vit-gpt2-image-captioning \
    --task image-to-text \
    <output>

Conversion script: scripts/export-vit-gpt-image-captioning.ps1 in the DatumIngest repo.

Toolchain: optimum 1.24.0, transformers 4.45.2, torch 2.4.x.

Inference notes

Setting Value
Input resolution 224Γ—224 (resized + center-cropped by preprocessor_config.json)
Output English caption, ~16-token median length
Max tokens 16 (default in generation_config.json)
Domain COCO-style natural scenes

License

Apache-2.0 β€” same as upstream. LICENSE file included.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Heliosoph/vit-gpt2-image-captioning-onnx

Quantized
(4)
this model