OpenWorldSAM β€” Zero-Shot Universal Image Segmentation

OpenWorldSAM extends SAM2 with language understanding to enable open-vocabulary instance, panoptic, semantic, and referring segmentation from arbitrary text prompts β€” without any task-specific fine-tuning.

This repository is an unofficial HuggingFace Hub mirror of the GinnyXiao/OpenWorldSAM checkpoint, uploaded to enable loading via transformers and the FiftyOne Model Zoo. All credit belongs to the original authors.

Attribution

Paper: "Extending SAM2 for Universal Image Segmentation with Language Prompts"
Fangxun Shu*, Yiwen Ye*, Jianhua Han, Jiwen Yu, Qize Yang, Xiao-Ping Zhang, Hang Xu, Bei Yu, Xiaodan Liang.
NeurIPS 2025 Spotlight Β· arXiv 2507.05427

Original code: GinnyXiao/OpenWorldSAM β€” Apache-2.0 License

Checkpoint: ADE20K instance segmentation (openworld_sam_ade20k.pt) from the official Google Drive release

Usage

Via FiftyOne Model Zoo

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart", max_samples=5)
model = foz.load_zoo_model(
    "openworld-sam-ade20k-torch",
    class_names=["person", "car", "chair", "table", "sky", "tree"],
)
dataset.apply_model(model, label_field="owsam_pred")
session = fo.launch_app(dataset)

Standalone (trust_remote_code)

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "neerajaabhyankar/openworld-sam", trust_remote_code=True
)
model.eval()

import numpy as np
arr = np.array(your_pil_image)  # HWC uint8 RGB

batched_inputs = [{
    "image":             model.preprocess_image(arr),
    "evf_image":         model.preprocess_image_beit3(arr),
    "height":            arr.shape[0],
    "width":             arr.shape[1],
    "prompt":            ["person", "car", "tree"],
    "unique_categories": [0, 1, 2],
}]

with torch.no_grad():
    outputs = model(batched_inputs)

# outputs[0]["instances"]:
#   "masks"     β€” bool tensor [N, H, W]
#   "scores"    β€” float tensor [N]
#   "class_ids" β€” long tensor [N]

Architecture

Component Detail
Visual backbone SAM2 Hiera-Large (frozen, 224M params)
Multimodal encoder BEiT-3 Large (frozen, 675M params)
Trainable params ~4.5M β€” projection MLP + positional tokens + 3-layer cross-attention
Total params ~902M
Vocabulary ADE20K-150 classes (default); any text at inference
SAM2 input 1024Γ—1024, normalised with ImageNet pixel stats
BEiT-3 input 224Γ—224, normalised to mean=0.5 std=0.5

Requirements

torch torchvision transformers safetensors timm einops

No detectron2 required β€” this mirror is self-contained.

License

Apache-2.0 (same as original GinnyXiao/OpenWorldSAM)

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.9B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Voxel51/openworld-sam