Instructions to use Heliosoph/florence-2-base-ft-fp16-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Heliosoph/florence-2-base-ft-fp16-onnx with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="Heliosoph/florence-2-base-ft-fp16-onnx")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Heliosoph/florence-2-base-ft-fp16-onnx") model = AutoModelForImageTextToText.from_pretrained("Heliosoph/florence-2-base-ft-fp16-onnx") - Notebooks
- Google Colab
- Kaggle
Florence-2 base-ft β ONNX (fp16)
ONNX export of microsoft/Florence-2-base-ft at fp16 precision. Florence-2 is Microsoft's unified vision-language model β a single checkpoint that handles captioning, OCR, object detection, region description, and grounded VQA via task-prompted decoding.
Converted artifact. Training credit: Microsoft Research.
What this repo contains
Florence-2 ships as four ONNX files (one per sub-model). All four are required at inference:
config.json
generation_config.json
preprocessor_config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json
vision_encoder_fp16.onnx # DaViT image encoder
encoder_model_fp16.onnx # text encoder (T5-style)
decoder_model_fp16.onnx # autoregressive decoder
embed_tokens_fp16.onnx # token embedding lookup
Total: ~520 MB. Use with optimum.onnxruntime.ORTModelForVision2Seq or load the four sessions manually.
How it was produced
optimum-cli export onnx \
--model microsoft/Florence-2-base-ft \
--task image-to-text \
--dtype fp16 \
--trust-remote-code \
<output>
Toolchain: optimum 1.24.0, transformers 4.45.2, torch 2.4.x. --trust-remote-code is required β Florence-2 ships custom modeling code (modeling_florence2.py) in the source repo.
Task prompts (selected)
| Task | Prompt |
|---|---|
| Caption | <CAPTION> |
| Detailed caption | <DETAILED_CAPTION> |
| More detailed caption | <MORE_DETAILED_CAPTION> |
| OCR | <OCR> |
| OCR with regions | <OCR_WITH_REGION> |
| Object detection | <OD> |
| Dense region caption | <DENSE_REGION_CAPTION> |
| Region proposal | <REGION_PROPOSAL> |
| Caption to phrase grounding | <CAPTION_TO_PHRASE_GROUNDING> |
| Referring expression segmentation | <REFERRING_EXPRESSION_SEGMENTATION> |
Full task-prompt list: see the upstream model card.
When to pick fp16 vs quantized
This repo (fp16): GPU inference, maximum quality. ~520 MB.
Heliosoph/florence-2-base-ft-quantized-onnx: CPU / NPU / mobile, INT8 dynamic. ~270 MB, modestly degraded on text-heavy OCR.
License
MIT β same as upstream. LICENSE file included.
- Downloads last month
- 13
Model tree for Heliosoph/florence-2-base-ft-fp16-onnx
Base model
microsoft/Florence-2-base-ft