FC-CLIP β€” Open-Vocabulary Panoptic Segmentation

FC-CLIP is an open-vocabulary panoptic segmentation model that pairs a frozen ConvNeXt-Large CLIP backbone with a lightweight Mask2Former decoder. It achieves strong zero-shot performance without requiring separate specialist models for things vs. stuff.

This repository hosts the COCO Panoptic checkpoint uploaded to HuggingFace Hub by Claude, for testing use with the FiftyOne Model Zoo.

Attribution

Paper: "A Simple Framework for Open-Vocabulary Segmentation and Detection"
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
CVPR 2023 Β· arxiv 2311.15539

Original code: bytedance/fc-clip β€” MIT License

Usage

Standalone (trust_remote_code)

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("neerajaabhyankar/fc-clip", trust_remote_code=True)
model.eval()

# Preprocess: RGB uint8 numpy/PIL β†’ normalised tensor
pixel_values = model.preprocess_image(your_pil_image)  # [1, 3, H, W]

with torch.no_grad():
    results = model(pixel_values)

panoptic_seg, segments_info = results[0]
# panoptic_seg: int32 tensor [H, W]  β€” pixel β†’ segment id
# segments_info: list[{"id", "category_id", "isthing"}]

Open-vocabulary (custom classes)

results = model(pixel_values, class_names=["cat", "dog", "sky", "grass"])

Architecture

Component Detail
Backbone OpenCLIP ConvNeXt-Large (convnext_large_d_320, laion2b_s29b_b131k_ft_soup), frozen
Pixel decoder 6-layer Multi-Scale Deformable Attention encoder + 1-level FPN
Transformer decoder 5-layer Mask2Former cross-attention decoder, 250 queries
Text classification VILD 14-template ensemble + geometric in-vocab/out-vocab blending
Classes 133 COCO panoptic (80 things + 53 stuff)

Requirements

torch torchvision transformers open_clip_torch safetensors

No detectron2 required β€” the model is self-contained.

License

MIT (same as original bytedance/fc-clip)

Downloads last month
46
Safetensors
Model size
20.7M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Voxel51/fc-clip