Apple FastVLM Vision Encoder (1024)

ONNX vision encoders for FastVLM image preprocessing and embedding extraction, provided in three precision variants so you can trade off accuracy, speed, and file size for your deployment.

Model files

File	Precision	Size	Notes
`vision_encoder_fp32.onnx`	FP32	508 MB	Full precision, highest numerical fidelity
`vision_encoder_safe_fp16.onnx`	FP16	255 MB	Good balance of accuracy and size
`vision_encoder_int8.onnx`	INT8	129 MB	Smallest and fastest, best for constrained CPU deployment

Usage

Use whichever variant best fits your accuracy/performance needs for image embedding generation in FastVLM-style CPU pipelines.

import onnxruntime as ort

# Swap in whichever variant you need
session = ort.InferenceSession("vision_encoder_fp32.onnx")
# feed in your preprocessed image tensor and run inference

Choosing a variant

FP32 — use when accuracy matters most and memory/size isn't a constraint.
FP16 — a middle ground: smaller and often faster than FP32 with minimal accuracy loss on supporting hardware.
INT8 — use for the smallest footprint and fastest CPU inference, with some tradeoff in numerical precision.

Base model

Derived from apple/FastVLM-0.5B.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for musk12/apple-fastvlm-vision-encoder-1024

Base model

apple/FastVLM-0.5B

Quantized

(6)

this model