Instructions to use Voxel51/fc-clip with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Voxel51/fc-clip with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="Voxel51/fc-clip", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Voxel51/fc-clip", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
FC-CLIP β Open-Vocabulary Panoptic Segmentation
FC-CLIP is an open-vocabulary panoptic segmentation model that pairs a frozen ConvNeXt-Large CLIP backbone with a lightweight Mask2Former decoder. It achieves strong zero-shot performance without requiring separate specialist models for things vs. stuff.
This repository hosts the COCO Panoptic checkpoint uploaded to HuggingFace Hub by Claude, for testing use with the FiftyOne Model Zoo.
Attribution
Paper: "A Simple Framework for Open-Vocabulary Segmentation and Detection"
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
CVPR 2023 Β· arxiv 2311.15539
Original code: bytedance/fc-clip β MIT License
Usage
Standalone (trust_remote_code)
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("neerajaabhyankar/fc-clip", trust_remote_code=True)
model.eval()
# Preprocess: RGB uint8 numpy/PIL β normalised tensor
pixel_values = model.preprocess_image(your_pil_image) # [1, 3, H, W]
with torch.no_grad():
results = model(pixel_values)
panoptic_seg, segments_info = results[0]
# panoptic_seg: int32 tensor [H, W] β pixel β segment id
# segments_info: list[{"id", "category_id", "isthing"}]
Open-vocabulary (custom classes)
results = model(pixel_values, class_names=["cat", "dog", "sky", "grass"])
Architecture
| Component | Detail |
|---|---|
| Backbone | OpenCLIP ConvNeXt-Large (convnext_large_d_320, laion2b_s29b_b131k_ft_soup), frozen |
| Pixel decoder | 6-layer Multi-Scale Deformable Attention encoder + 1-level FPN |
| Transformer decoder | 5-layer Mask2Former cross-attention decoder, 250 queries |
| Text classification | VILD 14-template ensemble + geometric in-vocab/out-vocab blending |
| Classes | 133 COCO panoptic (80 things + 53 stuff) |
Requirements
torch torchvision transformers open_clip_torch safetensors
No detectron2 required β the model is self-contained.
License
MIT (same as original bytedance/fc-clip)
- Downloads last month
- 46