PanoSAMic

PanoSAMic is a multi-modal semantic segmentation model for panoramic (360°) images. It integrates the frozen Segment Anything Model (SAM) encoder, modified to output multi-stage features, with a spatio-modal fusion module (MCBAM), a spherical-attention semantic decoder, and dual-view fusion to handle the distortion and edge discontinuity of equirectangular images.

What is in this repository

Only the trainable PanoSAMic components are hosted here:

  • Feature fusion blocks (MCBAM) — spatio-modal cross-attention applied to the branch features extracted by the frozen encoder
  • Semantic decoder — convolutional decoder with spherical attention and dual-view fusion head

The full model state dict has two parts:

Module prefix Trainable In Hub checkpoint
feature_fuser.* ✅ yes ✅ yes
semantic_decoder.* ✅ yes ✅ yes
image_encoder.* ❌ frozen (SAM ViT) ❌ no
prompt_encoder.* ❌ frozen (SAM) ❌ no
mask_decoder.* ❌ frozen (SAM) ❌ no

The frozen SAM ViT backbone is NOT hosted here. It is downloaded separately from Meta's official release (Apache-2.0) and combined at load time. This keeps each checkpoint small and avoids redistributing the SAM weights.

Available checkpoints

Each variant lives in its own subfolder of dfki-av/PanoSAMic (e.g. stanford2d3ds-vith-rgbdn-fold1/model.safetensors). 3-fold checkpoints are published per fold so each can be evaluated on its held-out split.

Checkpoint Backbone Modalities Dataset Split
stanford2d3ds-vith-rgb-fold1 ViT-H RGB Stanford2D3DS Fold 1
stanford2d3ds-vith-rgb-fold2 ViT-H RGB Stanford2D3DS Fold 2
stanford2d3ds-vith-rgb-fold3 ViT-H RGB Stanford2D3DS Fold 3
stanford2d3ds-vith-rgbd-fold1 ViT-H RGB-D Stanford2D3DS Fold 1
stanford2d3ds-vith-rgbd-fold2 ViT-H RGB-D Stanford2D3DS Fold 2
stanford2d3ds-vith-rgbd-fold3 ViT-H RGB-D Stanford2D3DS Fold 3
stanford2d3ds-vith-rgbdn-fold1 ViT-H RGB-D-N Stanford2D3DS Fold 1
stanford2d3ds-vith-rgbdn-fold2 ViT-H RGB-D-N Stanford2D3DS Fold 2
stanford2d3ds-vith-rgbdn-fold3 ViT-H RGB-D-N Stanford2D3DS Fold 3
stanford2d3ds-vitl-rgbdn-fold1 ViT-L RGB-D-N Stanford2D3DS Fold 1
stanford2d3ds-vitl-rgbdn-fold2 ViT-L RGB-D-N Stanford2D3DS Fold 2
stanford2d3ds-vitl-rgbdn-fold3 ViT-L RGB-D-N Stanford2D3DS Fold 3
stanford2d3ds-vitb-rgbdn-fold1 ViT-B RGB-D-N Stanford2D3DS Fold 1
stanford2d3ds-vitb-rgbdn-fold2 ViT-B RGB-D-N Stanford2D3DS Fold 2
stanford2d3ds-vitb-rgbdn-fold3 ViT-B RGB-D-N Stanford2D3DS Fold 3
matterport3d-vith-rgb ViT-H RGB Matterport3D BEV360
matterport3d-vith-rgbd ViT-H RGB-D Matterport3D BEV360

Reported results

Stanford2D3DS (3-fold validation), main table:

Checkpoint mIoU % mAcc % Trainable params (M)
stanford2d3ds-vith-rgb 59.62 74.11 178
stanford2d3ds-vith-rgbd 60.90 73.95 184
stanford2d3ds-vith-rgbdn 61.57 74.04 191

Encoder-size study (Stanford2D3DS, 3-fold, RGB-D-N):

Checkpoint mIoU % mAcc %
stanford2d3ds-vitb-rgbdn 56.68 70.49
stanford2d3ds-vitl-rgbdn 60.90 73.09
stanford2d3ds-vith-rgbdn 61.57 74.04

Matterport3D (BEV360 splits):

Checkpoint mIoU %
matterport3d-vith-rgb 46.59
matterport3d-vith-rgbd 48.43

How to reproduce

1. Environment

  • Python 3.11+
  • Install with uv sync from the GitHub repo (pyproject.toml pins dependencies)
  • 1× GPU with ≥16 GB VRAM for ViT-H inference (≥24 GB for training)

2. Get the frozen SAM backbone

Download the official SAM weights from Meta and place them in sam_weights/:

  • sam_vit_h_4b8939.pth
  • sam_vit_l_0b3195.pth
  • sam_vit_b_01ec64.pth

(See https://github.com/facebookresearch/segment-anything#model-checkpoints)

3. Load a checkpoint

from panosamic.model import PanoSAMic

model = PanoSAMic.from_pretrained_panosamic(
    "dfki-av/PanoSAMic",
    subfolder="stanford2d3ds-vith-rgbdn-fold1",
    config_path="config/config_stanford2d3ds_dv.json",
    vit_model="vit_h",
    modalities=("image", "depth", "normals"),
    num_classes=13,
    sam_weights_path="./sam_weights",  # omit to auto-download from Meta's servers
)

from_pretrained_panosamic loads only the trainable weights from the Hub, initialises the frozen SAM backbone from the local sam_weights/ directory (auto-downloaded if not present), and returns the model in eval() mode.

4. Run inference

import torch
from panosamic.model.instance_semantic_fusion import refine_semantic_with_instances

# batched_input: list of dicts, one per image.
# Each dict maps modality name → float tensor (3, H, W), values in [0, 255].
# Image must be equirectangular 2:1 (e.g. 512 × 1024).
batched_input = [{"image": image_tensor, "depth": depth_tensor, "normals": normals_tensor}]

with torch.no_grad():
    outputs = model(batched_input)

sem_preds = outputs[0]["sem_preds"]        # (num_classes, H, W) — logits
instance_masks = outputs[0]["instance_masks"]

# Instance-guided refinement: each SAM mask is assigned the majority
# semantic class within it, sharpening boundaries.
if instance_masks:
    sem_preds = refine_semantic_with_instances(sem_preds, instance_masks)

seg_map = sem_preds.argmax(dim=0)  # (H, W) — integer class indices

5. Prepare the data

Use the exact splits reported in the paper:

6. Run evaluation

From a released Hub checkpoint (trainable weights only, SAM loaded separately):

python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --checkpoint dfki-av/PanoSAMic \
    --subfolder stanford2d3ds-vith-rgbdn-fold1 \
    --sam_weights_path ./sam_weights \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1

From a local training run (full checkpoint including frozen backbone):

python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --experiments_path ./experiments \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1

Repeat for folds 1–3 and average for the 3-fold numbers. For Matterport3D use config/config_matterport3d_dv.json, --dataset matterport3d, and the modalities for that row.

7. Key configuration (matches the paper)

  • Frozen SAM ViT-H, encoder depth 32, global attention at blocks [8, 16, 24, 32]
  • Batch size 8, 50 epochs, Ranger21 optimizer
  • Max LR 0.0005 (Stanford2D3DS) / 0.001 (Matterport3D)
  • Input resized to 512 × 1024
  • MCBAM window 8×8, stride 4; spherical attention kernel 7×7, stride 1
  • Dual-view shift s = W/2
  • Loss: Jaccard (Stanford2D3DS); alternating Cross-Entropy/Jaccard schedule (Matterport3D)
  • Depth preprocessed to pseudo-disparity (threshold = 99.5th percentile of train depths, rounded to nearest 10 cm), replicated to 3 channels

Intended use and limitations

Indoor panoramic semantic segmentation with RGB / RGB-D / RGB-D-N input. Evaluated only on indoor datasets; outdoor generalization is not guaranteed.

License and access terms

  • This model card and the released trainable weights: CC BY-NC-SA 4.0 (Attribution–NonCommercial–ShareAlike). Use is restricted to non-commercial purposes.
  • The frozen SAM backbone (downloaded separately) remains under its original Apache-2.0 license from Meta AI.

Citation

@article{chamseddine2026panosamic,
  title   = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion},
  author  = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason},
  journal = {arXiv preprint arXiv:2601.07447},
  year    = {2026}
}

Acknowledgement

Funded by the European Union as part of the projects HumanTech (Grant Agreement 101058236) and ShieldBOT (Grant Agreement 101235093).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using dfki-av/PanoSAMic 1

Paper for dfki-av/PanoSAMic