Instructions to use Voxel51/openworld-sam with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Voxel51/openworld-sam with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="Voxel51/openworld-sam")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Voxel51/openworld-sam", dtype="auto") - sam2
How to use Voxel51/openworld-sam with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(Voxel51/openworld-sam) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(Voxel51/openworld-sam) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
OpenWorldSAM β Zero-Shot Universal Image Segmentation
OpenWorldSAM extends SAM2 with language understanding to enable open-vocabulary instance, panoptic, semantic, and referring segmentation from arbitrary text prompts β without any task-specific fine-tuning.
This repository is an unofficial HuggingFace Hub mirror of the
GinnyXiao/OpenWorldSAM checkpoint,
uploaded to enable loading via transformers and the
FiftyOne Model Zoo.
All credit belongs to the original authors.
Attribution
Paper: "Extending SAM2 for Universal Image Segmentation with Language Prompts"
Fangxun Shu*, Yiwen Ye*, Jianhua Han, Jiwen Yu, Qize Yang, Xiao-Ping Zhang, Hang Xu, Bei Yu, Xiaodan Liang.
NeurIPS 2025 Spotlight Β· arXiv 2507.05427
Original code: GinnyXiao/OpenWorldSAM β Apache-2.0 License
Checkpoint: ADE20K instance segmentation (
openworld_sam_ade20k.pt) from the official Google Drive release
Usage
Via FiftyOne Model Zoo
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart", max_samples=5)
model = foz.load_zoo_model(
"openworld-sam-ade20k-torch",
class_names=["person", "car", "chair", "table", "sky", "tree"],
)
dataset.apply_model(model, label_field="owsam_pred")
session = fo.launch_app(dataset)
Standalone (trust_remote_code)
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"neerajaabhyankar/openworld-sam", trust_remote_code=True
)
model.eval()
import numpy as np
arr = np.array(your_pil_image) # HWC uint8 RGB
batched_inputs = [{
"image": model.preprocess_image(arr),
"evf_image": model.preprocess_image_beit3(arr),
"height": arr.shape[0],
"width": arr.shape[1],
"prompt": ["person", "car", "tree"],
"unique_categories": [0, 1, 2],
}]
with torch.no_grad():
outputs = model(batched_inputs)
# outputs[0]["instances"]:
# "masks" β bool tensor [N, H, W]
# "scores" β float tensor [N]
# "class_ids" β long tensor [N]
Architecture
| Component | Detail |
|---|---|
| Visual backbone | SAM2 Hiera-Large (frozen, 224M params) |
| Multimodal encoder | BEiT-3 Large (frozen, 675M params) |
| Trainable params | ~4.5M β projection MLP + positional tokens + 3-layer cross-attention |
| Total params | ~902M |
| Vocabulary | ADE20K-150 classes (default); any text at inference |
| SAM2 input | 1024Γ1024, normalised with ImageNet pixel stats |
| BEiT-3 input | 224Γ224, normalised to mean=0.5 std=0.5 |
Requirements
torch torchvision transformers safetensors timm einops
No detectron2 required β this mirror is self-contained.
License
Apache-2.0 (same as original GinnyXiao/OpenWorldSAM)