SpaceFormer β Open-Vocabulary 3D Instance Segmentation
SpaceFormer performs proposal-free, open-vocabulary 3D instance segmentation.
A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs
on top of the WarpConvNet SpaCeFormer sparse
point backbone. A single forward pass over an RGB point cloud produces a fixed set of
query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP
feature against text embeddings of arbitrary class names (SigLIP2 text encoder, with
prompt ensembling). The vocabulary is chosen at inference time β it is not baked into the
weights β so the model can be queried with any label set.
Project page: https://nvlabs.github.io/SpaCeFormer/
Model details
- Task: open-vocabulary 3D instance segmentation on RGB point clouds.
- Architecture: WarpConvNet
SpaCeFormerbackbone (mixed space/curve sparse attention U-Net,sscccencoder) β proposal-free query decoder (hidden dim 512, 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) β objectness + per-point mask + per-query CLIP heads. ~85.8M parameters. - CLIP/text embedding:
google/siglip2-so400m-patch14-224(1152-d), used only at inference to embed class names; not stored in this checkpoint. - Input: point coordinates in meters + RGB; voxelized internally at 2 cm.
- Naming:
spaceformer_512_siglip2_ssccc= hidden dim 512 Β· SigLIP2 embedding Β·sscccencoder attention (space, space, curve, curve, curve).
Evaluation
Test-set mAP with the released recipe (prompt ensembling on, TTA off, default proposal-free post-processing):
| Benchmark | mAP | mAP50 | recall (class-agnostic) |
|---|---|---|---|
| ScanNet200 | 0.1265 | 0.210 | 0.756 |
| ScanNet++ | 0.2217 | β | β |
| Replica | 0.2644 | β | β |
How to use
The model lives in WarpConvNet as warpconvnet.models.spaceformer (the backbone needs
WarpConvNet's compiled CUDA extension β install a pre-built wheel or build from source).
It returns raw predictions; open-vocab labeling + mask post-processing live in the
demo repo / HuggingFace Space, not in WarpConvNet.
import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download
device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")
net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False
# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}
To turn clip_feat into open-vocabulary labels (SigLIP2 text + prompt ensembling) and
clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space
(pipeline.py, clip_eval.py, text_encoder.py, postprocessing.py, labels.py) β
e.g. its inference.py CLI or the Gradio app.py.
Intended use & limitations
- Intended: research on open-vocabulary 3D scene understanding; segmenting indoor RGB point clouds (ScanNet-like) against custom class vocabularies.
- Open-vocab mAP is semantics-bottlenecked: rare/fine-grained classes are weaker than head classes; class-agnostic mask recall is higher than the open-vocab mAP.
- Domain: trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very different sensor domains are out of distribution.
- Large scenes: very large clouds can exceed memory in the eval forward; the inference code skips such a scene (single-process) rather than crashing.
Files
spaceformer_512_siglip2_ssccc.ckptβ weights-only Lightningstate_dict(487 tensors;net.*decoder/backbone +caption_loss.logit_scale). Load viaload_spaceformer_checkpoint(strips thenet.prefix,strict=False).spaceformer_512_siglip2_ssccc.ckpt.provenance.jsonβ architecture, eval numbers, md5.
License & usage
These weights are released for non-commercial research use only, under CC-BY-NC-4.0. They are a derivative of datasets governed by non-commercial research Terms of Use, so they are not released under the permissive Apache-2.0 license that covers the code.
The model was trained on the following datasets, each of which restricts use to non-commercial research/education under its own terms β by using these weights you agree to comply with all of them:
- ScanNet / ScanNet200 β ScanNet Terms of Use
- ScanNet++ β ScanNet++ Terms of Use
- ARKitScenes β Apple ARKitScenes license (non-commercial)
- Matterport3D β Matterport3D Terms of Use (non-commercial academic)
Evaluation additionally used Replica (Replica Research Terms, non-commercial), zero-shot.
The accompanying code in WarpConvNet is licensed separately under Apache-2.0.
Note: this is not legal advice; for commercial use, consult the individual dataset licensors. Please also cite the datasets above and the SpaceFormer project.