PanoSAMic

PanoSAMic is a multi-modal semantic segmentation model for panoramic (360°) images. It integrates the frozen Segment Anything Model (SAM) encoder, modified to output multi-stage features, with a spatio-modal fusion module (MCBAM), a spherical-attention semantic decoder, and dual-view fusion to handle the distortion and edge discontinuity of equirectangular images.

Paper: PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion (ICPR 2026)
Code: https://github.com/dfki-av/PanoSAMic
arXiv: https://arxiv.org/abs/2601.07447
Authors: Mahdi Chamseddine, Didier Stricker, Jason Rambach (DFKI / RPTU Kaiserslautern-Landau)

What is in this repository

Only the trainable PanoSAMic components are hosted here:

Feature fusion blocks (MCBAM) — spatio-modal cross-attention applied to the branch features extracted by the frozen encoder
Semantic decoder — convolutional decoder with spherical attention and dual-view fusion head

The full model state dict has two parts:

Module prefix	Trainable	In Hub checkpoint
`feature_fuser.*`	✅ yes	✅ yes
`semantic_decoder.*`	✅ yes	✅ yes
`image_encoder.*`	❌ frozen (SAM ViT)	❌ no
`prompt_encoder.*`	❌ frozen (SAM)	❌ no
`mask_decoder.*`	❌ frozen (SAM)	❌ no

The frozen SAM ViT backbone is NOT hosted here. It is downloaded separately from Meta's official release (Apache-2.0) and combined at load time. This keeps each checkpoint small and avoids redistributing the SAM weights.

Available checkpoints

Each variant lives in its own subfolder of dfki-av/PanoSAMic (e.g. stanford2d3ds-vith-rgbdn-fold1/model.safetensors). 3-fold checkpoints are published per fold so each can be evaluated on its held-out split.

Checkpoint	Backbone	Modalities	Dataset	Split
`stanford2d3ds-vith-rgb-fold1`	ViT-H	RGB	Stanford2D3DS	Fold 1
`stanford2d3ds-vith-rgb-fold2`	ViT-H	RGB	Stanford2D3DS	Fold 2
`stanford2d3ds-vith-rgb-fold3`	ViT-H	RGB	Stanford2D3DS	Fold 3
`stanford2d3ds-vith-rgbd-fold1`	ViT-H	RGB-D	Stanford2D3DS	Fold 1
`stanford2d3ds-vith-rgbd-fold2`	ViT-H	RGB-D	Stanford2D3DS	Fold 2
`stanford2d3ds-vith-rgbd-fold3`	ViT-H	RGB-D	Stanford2D3DS	Fold 3
`stanford2d3ds-vith-rgbdn-fold1`	ViT-H	RGB-D-N	Stanford2D3DS	Fold 1
`stanford2d3ds-vith-rgbdn-fold2`	ViT-H	RGB-D-N	Stanford2D3DS	Fold 2
`stanford2d3ds-vith-rgbdn-fold3`	ViT-H	RGB-D-N	Stanford2D3DS	Fold 3
`stanford2d3ds-vitl-rgbdn-fold1`	ViT-L	RGB-D-N	Stanford2D3DS	Fold 1
`stanford2d3ds-vitl-rgbdn-fold2`	ViT-L	RGB-D-N	Stanford2D3DS	Fold 2
`stanford2d3ds-vitl-rgbdn-fold3`	ViT-L	RGB-D-N	Stanford2D3DS	Fold 3
`stanford2d3ds-vitb-rgbdn-fold1`	ViT-B	RGB-D-N	Stanford2D3DS	Fold 1
`stanford2d3ds-vitb-rgbdn-fold2`	ViT-B	RGB-D-N	Stanford2D3DS	Fold 2
`stanford2d3ds-vitb-rgbdn-fold3`	ViT-B	RGB-D-N	Stanford2D3DS	Fold 3
`matterport3d-vith-rgb`	ViT-H	RGB	Matterport3D	BEV360
`matterport3d-vith-rgbd`	ViT-H	RGB-D	Matterport3D	BEV360

Reported results

Stanford2D3DS (3-fold validation), main table:

Checkpoint	mIoU %	mAcc %	Trainable params (M)
`stanford2d3ds-vith-rgb`	59.62	74.11	178
`stanford2d3ds-vith-rgbd`	60.90	73.95	184
`stanford2d3ds-vith-rgbdn`	61.57	74.04	191

Encoder-size study (Stanford2D3DS, 3-fold, RGB-D-N):

Checkpoint	mIoU %	mAcc %
`stanford2d3ds-vitb-rgbdn`	56.68	70.49
`stanford2d3ds-vitl-rgbdn`	60.90	73.09
`stanford2d3ds-vith-rgbdn`	61.57	74.04

Matterport3D (BEV360 splits):

Checkpoint	mIoU %
`matterport3d-vith-rgb`	46.59
`matterport3d-vith-rgbd`	48.43

How to reproduce

1. Environment

Python 3.11+
Install with uv sync from the GitHub repo (pyproject.toml pins dependencies)
1× GPU with ≥16 GB VRAM for ViT-H inference (≥24 GB for training)

2. Get the frozen SAM backbone

Download the official SAM weights from Meta and place them in sam_weights/:

sam_vit_h_4b8939.pth
sam_vit_l_0b3195.pth
sam_vit_b_01ec64.pth

(See https://github.com/facebookresearch/segment-anything#model-checkpoints)

3. Load a checkpoint

from panosamic.model import PanoSAMic

model = PanoSAMic.from_pretrained_panosamic(
    "dfki-av/PanoSAMic",
    subfolder="stanford2d3ds-vith-rgbdn-fold1",
    config_path="config/config_stanford2d3ds_dv.json",
    vit_model="vit_h",
    modalities=("image", "depth", "normals"),
    num_classes=13,
    sam_weights_path="./sam_weights",  # omit to auto-download from Meta's servers
)

from_pretrained_panosamic loads only the trainable weights from the Hub, initialises the frozen SAM backbone from the local sam_weights/ directory (auto-downloaded if not present), and returns the model in eval() mode.

4. Run inference

import torch
from panosamic.model.instance_semantic_fusion import refine_semantic_with_instances

# batched_input: list of dicts, one per image.
# Each dict maps modality name → float tensor (3, H, W), values in [0, 255].
# Image must be equirectangular 2:1 (e.g. 512 × 1024).
batched_input = [{"image": image_tensor, "depth": depth_tensor, "normals": normals_tensor}]

with torch.no_grad():
    outputs = model(batched_input)

sem_preds = outputs[0]["sem_preds"]        # (num_classes, H, W) — logits
instance_masks = outputs[0]["instance_masks"]

# Instance-guided refinement: each SAM mask is assigned the majority
# semantic class within it, sharpening boundaries.
if instance_masks:
    sem_preds = refine_semantic_with_instances(sem_preds, instance_masks)

seg_map = sem_preds.argmax(dim=0)  # (H, W) — integer class indices

5. Prepare the data

Use the exact splits reported in the paper:

Stanford2D3DS: the authors' 3-fold cross-validation splits. Source: https://github.com/alexsax/2D-3D-Semantics . Preprocess with panosamic/data_preparation/ into the processed structure documented in the repo README.
Matterport3D: the BEV360 pre-processed data and splits (20-class subset) for a fair comparison. Source: https://github.com/InSAI-Lab/360BEV .

6. Run evaluation

From a released Hub checkpoint (trainable weights only, SAM loaded separately):

python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --checkpoint dfki-av/PanoSAMic \
    --subfolder stanford2d3ds-vith-rgbdn-fold1 \
    --sam_weights_path ./sam_weights \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1

From a local training run (full checkpoint including frozen backbone):

python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --experiments_path ./experiments \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1

Repeat for folds 1–3 and average for the 3-fold numbers. For Matterport3D use config/config_matterport3d_dv.json, --dataset matterport3d, and the modalities for that row.

7. Key configuration (matches the paper)

Frozen SAM ViT-H, encoder depth 32, global attention at blocks [8, 16, 24, 32]
Batch size 8, 50 epochs, Ranger21 optimizer
Max LR 0.0005 (Stanford2D3DS) / 0.001 (Matterport3D)
Input resized to 512 × 1024
MCBAM window 8×8, stride 4; spherical attention kernel 7×7, stride 1
Dual-view shift s = W/2
Loss: Jaccard (Stanford2D3DS); alternating Cross-Entropy/Jaccard schedule (Matterport3D)
Depth preprocessed to pseudo-disparity (threshold = 99.5th percentile of train depths, rounded to nearest 10 cm), replicated to 3 channels

Intended use and limitations

Indoor panoramic semantic segmentation with RGB / RGB-D / RGB-D-N input. Evaluated only on indoor datasets; outdoor generalization is not guaranteed.

License and access terms

This model card and the released trainable weights: CC BY-NC-SA 4.0 (Attribution–NonCommercial–ShareAlike). Use is restricted to non-commercial purposes.
The frozen SAM backbone (downloaded separately) remains under its original Apache-2.0 license from Meta AI.

Citation

@article{chamseddine2026panosamic,
  title   = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion},
  author  = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason},
  journal = {arXiv preprint arXiv:2601.07447},
  year    = {2026}
}

Acknowledgement

Funded by the European Union as part of the projects HumanTech (Grant Agreement 101058236) and ShieldBOT (Grant Agreement 101235093).

Downloads last month: -; Downloads are not tracked for this model. How to track

Space using dfki-av/PanoSAMic 1

Paper for dfki-av/PanoSAMic

PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion

Paper • 2601.07447 • Published Apr 24