SpaceFormer — Open-Vocabulary 3D Instance Segmentation

SpaceFormer performs proposal-free, open-vocabulary 3D instance segmentation. A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs on top of the WarpConvNet SpaCeFormer sparse point backbone. A single forward pass over an RGB point cloud produces a fixed set of query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP feature against text embeddings of arbitrary class names (SigLIP2 text encoder, with prompt ensembling). The vocabulary is chosen at inference time — it is not baked into the weights — so the model can be queried with any label set.

Project page: https://nvlabs.github.io/SpaCeFormer/

Model details

Task: open-vocabulary 3D instance segmentation on RGB point clouds.
Architecture: WarpConvNet SpaCeFormer backbone (mixed space/curve sparse attention U-Net, ssccc encoder) → proposal-free query decoder (hidden dim 512, 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) → objectness + per-point mask + per-query CLIP heads. ~85.8M parameters.
CLIP/text embedding: google/siglip2-so400m-patch14-224 (1152-d), used only at inference to embed class names; not stored in this checkpoint.
Input: point coordinates in meters + RGB; voxelized internally at 2 cm.
Naming: spaceformer_512_siglip2_ssccc = hidden dim 512 · SigLIP2 embedding · ssccc encoder attention (space, space, curve, curve, curve).

Evaluation

Test-set mAP with the released recipe (prompt ensembling on, TTA off, default proposal-free post-processing):

Benchmark	mAP	mAP50	recall (class-agnostic)
ScanNet200	0.1265	0.210	0.756
ScanNet++	0.2217	—	—
Replica	0.2644	—	—

How to use

The model lives in WarpConvNet as warpconvnet.models.spaceformer (the backbone needs WarpConvNet's compiled CUDA extension — install a pre-built wheel or build from source). It returns raw predictions; open-vocab labeling + mask post-processing live in the demo repo / HuggingFace Space, not in WarpConvNet.

import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download

device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")

net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt)          # 487 tensors, strict=False

# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}

To turn clip_feat into open-vocabulary labels (SigLIP2 text + prompt ensembling) and clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space (pipeline.py, clip_eval.py, text_encoder.py, postprocessing.py, labels.py) — e.g. its inference.py CLI or the Gradio app.py.

Intended use & limitations

Intended: research on open-vocabulary 3D scene understanding; segmenting indoor RGB point clouds (ScanNet-like) against custom class vocabularies.
Open-vocab mAP is semantics-bottlenecked: rare/fine-grained classes are weaker than head classes; class-agnostic mask recall is higher than the open-vocab mAP.
Domain: trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very different sensor domains are out of distribution.
Large scenes: very large clouds can exceed memory in the eval forward; the inference code skips such a scene (single-process) rather than crashing.

Files

spaceformer_512_siglip2_ssccc.ckpt — weights-only Lightning state_dict (487 tensors; net.* decoder/backbone + caption_loss.logit_scale). Load via load_spaceformer_checkpoint (strips the net. prefix, strict=False).
spaceformer_512_siglip2_ssccc.ckpt.provenance.json — architecture, eval numbers, md5.

License & usage

These weights are released for non-commercial research use only, under CC-BY-NC-4.0. They are a derivative of datasets governed by non-commercial research Terms of Use, so they are not released under the permissive Apache-2.0 license that covers the code.

The model was trained on the following datasets, each of which restricts use to non-commercial research/education under its own terms — by using these weights you agree to comply with all of them:

ScanNet / ScanNet200 — ScanNet Terms of Use
ScanNet++ — ScanNet++ Terms of Use
ARKitScenes — Apple ARKitScenes license (non-commercial)
Matterport3D — Matterport3D Terms of Use (non-commercial academic)

Evaluation additionally used Replica (Replica Research Terms, non-commercial), zero-shot.

The accompanying code in WarpConvNet is licensed separately under Apache-2.0.

Note: this is not legal advice; for commercial use, consult the individual dataset licensors. Please also cite the datasets above and the SpaceFormer project.

Downloads last month: -; Downloads are not tracked for this model. How to track