Instructions to use Heliosoph/vit-gpt2-image-captioning-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Heliosoph/vit-gpt2-image-captioning-onnx with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="Heliosoph/vit-gpt2-image-captioning-onnx")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("Heliosoph/vit-gpt2-image-captioning-onnx") model = AutoModelForImageTextToText.from_pretrained("Heliosoph/vit-gpt2-image-captioning-onnx") - Notebooks
- Google Colab
- Kaggle
ViT-GPT2 Image Captioning β ONNX
ONNX export of nlpconnect/vit-gpt2-image-captioning β a classic ViT encoder + GPT-2 decoder image captioner. ~240M parameters, trained on COCO captions.
Lightweight baseline captioner. Florence-2 is the better default for new projects (smaller, more capable, multi-task), but this one is useful when you need a vanilla "describe this image in one sentence" with minimal dependencies.
Converted artifact. Training credit: nlpconnect.
What this repo contains
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json
encoder_model.onnx # ViT image encoder
decoder_model.onnx # GPT-2 autoregressive decoder
Total: ~1.1 GB at fp32. Load with optimum.onnxruntime.ORTModelForVision2Seq.
How it was produced
optimum-cli export onnx \
--model nlpconnect/vit-gpt2-image-captioning \
--task image-to-text \
<output>
Conversion script: scripts/export-vit-gpt-image-captioning.ps1 in the DatumIngest repo.
Toolchain: optimum 1.24.0, transformers 4.45.2, torch 2.4.x.
Inference notes
| Setting | Value |
|---|---|
| Input resolution | 224Γ224 (resized + center-cropped by preprocessor_config.json) |
| Output | English caption, ~16-token median length |
| Max tokens | 16 (default in generation_config.json) |
| Domain | COCO-style natural scenes |
License
Apache-2.0 β same as upstream. LICENSE file included.
- Downloads last month
- 14
Model tree for Heliosoph/vit-gpt2-image-captioning-onnx
Base model
nlpconnect/vit-gpt2-image-captioning