OpenWorldSAM — Zero-Shot Universal Image Segmentation

OpenWorldSAM extends SAM2 with language understanding to enable open-vocabulary instance, panoptic, semantic, and referring segmentation from arbitrary text prompts — without any task-specific fine-tuning.

This repository is an unofficial HuggingFace Hub mirror of the GinnyXiao/OpenWorldSAM checkpoint, uploaded to enable loading via transformers and the FiftyOne Model Zoo. All credit belongs to the original authors.

Attribution

Paper: "Extending SAM2 for Universal Image Segmentation with Language Prompts"
Fangxun Shu*, Yiwen Ye*, Jianhua Han, Jiwen Yu, Qize Yang, Xiao-Ping Zhang, Hang Xu, Bei Yu, Xiaodan Liang.
NeurIPS 2025 Spotlight · arXiv 2507.05427

Original code: GinnyXiao/OpenWorldSAM — Apache-2.0 License

Checkpoint: ADE20K instance segmentation (openworld_sam_ade20k.pt) from the official Google Drive release

Usage

Via FiftyOne Model Zoo

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart", max_samples=5)
model = foz.load_zoo_model(
    "openworld-sam-ade20k-torch",
    class_names=["person", "car", "chair", "table", "sky", "tree"],
)
dataset.apply_model(model, label_field="owsam_pred")
session = fo.launch_app(dataset)

Standalone (`trust_remote_code`)

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "neerajaabhyankar/openworld-sam", trust_remote_code=True
)
model.eval()

import numpy as np
arr = np.array(your_pil_image)  # HWC uint8 RGB

batched_inputs = [{
    "image":             model.preprocess_image(arr),
    "evf_image":         model.preprocess_image_beit3(arr),
    "height":            arr.shape[0],
    "width":             arr.shape[1],
    "prompt":            ["person", "car", "tree"],
    "unique_categories": [0, 1, 2],
}]

with torch.no_grad():
    outputs = model(batched_inputs)

# outputs[0]["instances"]:
#   "masks"     — bool tensor [N, H, W]
#   "scores"    — float tensor [N]
#   "class_ids" — long tensor [N]

Architecture

Component	Detail
Visual backbone	SAM2 Hiera-Large (frozen, 224M params)
Multimodal encoder	BEiT-3 Large (frozen, 675M params)
Trainable params	~4.5M — projection MLP + positional tokens + 3-layer cross-attention
Total params	~902M
Vocabulary	ADE20K-150 classes (default); any text at inference
SAM2 input	1024×1024, normalised with ImageNet pixel stats
BEiT-3 input	224×224, normalised to mean=0.5 std=0.5

Requirements

torch torchvision transformers safetensors timm einops

No detectron2 required — this mirror is self-contained.

License

Apache-2.0 (same as original GinnyXiao/OpenWorldSAM)

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.9B params

Tensor type

F32

Paper for Voxel51/openworld-sam

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Paper • 2507.05427 • Published Feb 2