EfficientSAM3 OpenVINO

Pre-exported OpenVINO IR models for EfficientSAM3 enabling fast zero-shot and few-shot object detection on Intel hardware. Supports multiple backbone variants.

Backbones

Backbone Parameters Notes
EfficientViT-B1 ~50M Default backbone, best overall accuracy
RepViT-M1.1 ~30M Lighter backbone, GPU-friendly (no NaN in FP16 vision encoder)

Variants

Variant Description
openvino-fp16 FP16 weights, highest accuracy baseline
openvino-int8_sym INT8 symmetric weight-only compression (~47% smaller, ~10% faster)
openvino-int8_asym INT8 asymmetric weight-only compression
openvino-int8_ptq_gpu Full W8A8 PTQ with calibration (1.5-1.7x faster, GPU-targeted)
onnx Original ONNX exports (auto-converted by OV runtime)

Benchmark Results (Classic Mode)

EfficientViT-B1

Intel B60 GPU (BMG-G31) + CPU decoder

Dataset FP16 INT8_SYM INT8_PTQ F1 (PTQ)
Potatoes (10 img) 822 ms 840 ms 536 ms 1.000
Nuts (21 img) 1531 ms 1399 ms 925 ms 0.615
Candies (12 img) 829 ms 765 ms 543 ms 0.994

Intel 12900K CPU

Dataset FP16 INT8_SYM INT8_PTQ F1 (PTQ)
Potatoes 1095 ms 1028 ms 634 ms 1.000
Nuts 1751 ms 1610 ms 1056 ms 0.575
Candies 1083 ms 1044 ms 672 ms 0.927

RepViT-M1.1

Intel B60 GPU (BMG-G31) + CPU decoder

Dataset FP16 INT8_SYM INT8_ASYM INT8_PTQ F1 (PTQ)
Potatoes (10 img) 852 ms 801 ms 766 ms 533 ms 1.000
Nuts (21 img) 1511 ms 1449 ms 1408 ms 923 ms 0.632
Candies (12 img) 839 ms 770 ms 768 ms 518 ms 0.994

Intel 12900K CPU

Dataset FP16 INT8_SYM INT8_ASYM INT8_PTQ F1 (PTQ)
Potatoes 1065 ms 1048 ms 1082 ms 642 ms 1.000
Nuts 1770 ms 1671 ms 1686 ms 1056 ms 0.575
Candies 1082 ms 1035 ms 1026 ms 652 ms 0.927

Note: GPU mode uses a hybrid configuration (vision/text/geometry encoders on GPU with f32 precision hint, prompt-decoder on CPU) to work around Intel GPU plugin numerical issues.

Sub-models (5-model split)

Each backbone produces the same 5 sub-models:

Sub-model Purpose
vision-encoder Backbone + FPN feature extraction
text-encoder MobileCLIP-S1 + projection for text prompts
geometry-encoder Classic box/point prompt encoding
geometry-encoder-exemplar Visual exemplar prompt encoding
prompt-decoder DETR encoder/decoder + box head + scoring

Intel GPU Workarounds

When running on Intel GPUs (Arc/Xe/Battlemage), two workarounds are automatically applied:

  1. FP16 overflow: Some sub-models produce NaN/garbage in FP16 on GPU. Fixed by compiling with INFERENCE_PRECISION_HINT=f32.
  2. Decoder GPU numerical drift: Prompt-decoder logits drift vs CPU regardless of precision. Fixed by running decoder on CPU (~60 MB, minimal latency impact).

These are transparent to users — smart defaults handle everything automatically.

Requirements

  • OpenVINO >= 2025.3.0
  • NNCF >= 3.1.0 (for PTQ only)
  • transformers (for CLIP tokenizer)

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support