YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain
Jyothiraditya Lingam, Nikhileswara Rao Sulake, Sai Manikanta Eswar Machara
Department of Computer Science and Engineering Rajiv Gandhi University of Knowledge Technologies (RGUKT), Nuzvid, Andhra Pradesh, India
π Paper β’ π» Code β’ π€ Hugging Face β’ π Challenge
GOOSE-M2F is a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation Challenge (ICRA 2026). The proposed framework addresses long-tailed semantic segmentation in unstructured outdoor environments through enhanced object query capacity, feature refinement, auxiliary supervision, class-balanced optimization, and robust multi-scale inference.
Official Challenge Performance: 70.08% Composite mIoU (63.55% Fine mIoU, 76.61% Coarse mIoU), achieving 3rd Place on the GOOSE 2D FGSS Challenge Leaderboard.
π’ News
- [ICRA 2026] GOOSE-M2F achieved 3rd Place in the GOOSE 2D Fine-Grained Semantic Segmentation Challenge.
- [2026] Source code and trained models released.
- [2026] Technical report available on arXiv.
What is GOOSE-M2F?
The GOOSE dataset presents one of the most challenging real-world segmentation benchmarks: 64 fine-grained classes across diverse unstructured outdoor environments including forests, gravel paths, construction zones, and agricultural terrain β with a severely long-tailed class distribution.
GOOSE-M2F extends the baseline Mask2Former (Swin-Large backbone) with three key modifications engineered specifically for this challenge:
| Modification | Problem Solved | Impact |
|---|---|---|
| 200 Object Queries (vs 100) | Query saturation in 64-class scenes | +2-3% composite mIoU |
| Feature Refinement Module (FRM) β ASPP-lite + CBAM | Over-segmentation of amorphous terrain classes | +3-4% on Vegetation/Terrain |
| Auxiliary Supervision Head at H/4 resolution | Vanishing gradients for tiny/thin classes | +5-8% on rare classes |
Architecture
Input Image [B, 3, H, W]
β
βΌ
Swin-Large Backbone (Hierarchical, 4 stages)
Stage 1-4: channels {192, 384, 768, 1536}, resolutions {H/4 β H/32}
β
βΌ
MSDeformAttn Pixel Decoder (6-layer FPN)
Output: mask_features [B, 256, H/4, W/4]
β
ββββββββββββββββββββββββββββββββββββββββ
βΌ βΌ
[NEW] Feature Refinement Module [NEW] Auxiliary Head
ASPP-lite: dilations {1, 3, 6, 12} Conv(256β256β64)
+ Global Average Pooling DB-weighted CE loss
+ CBAM Dual-Attention (Ch + Sp) Supervised at H/4
β
βΌ
Transformer Decoder (9 layers)
[MOD] 200 Object Queries (was 100)
Masked Cross-Attention
β
βΌ
Class Head [B, 200, 65] Γ Mask Head [B, 200, H/4, W/4]
β
βΌ
Hungarian Matching β Semantic Prediction
Training Strategy
| Technique | Description |
|---|---|
| Distribution-Balanced (DB) Loss | w_c = (1-Ξ²)/(1-Ξ²^n_c), Ξ²=0.9999. Amplifies gradients for rare classes. |
| Rare-Class Copy-Paste (RCCP) | Pre-extracted rare-class cutouts pasted onto training images at 85% probability. |
| Dynamic IoU-Aware Weights | Per-class loss weights updated every epoch from validation IoU (0%β4x, 80%+β1x). |
| 10x LR Jump (V4) | Backbone 1e-5, Decoder 5e-5 β broke the model out of a local minimum at ~55%. |
| EMA (decay=0.9995) | Shadow weights consistently +1.0β1.5% over raw model on validation. |
| Class-Aware Repeat Sampling | Oversamples images containing rare classes proportional to their rarity. |
| Polynomial LR Decay | Gradual decay after warmup, with annealing in final sessions. |
Training Progression (V1 β V8)
| Session | Base LR | Backbone LR | Official Score |
|---|---|---|---|
| V1 (S3) | 5e-6 | 1e-6 | 50.68% |
| V2 (S4) | 5e-6 | 1e-6 | 54.62% |
| V3 (S5) | 5e-6 | 1e-6 | 55.64% |
| V4 (S6) | 5e-5 | 1e-5 | 56.38% β 10x LR Jump |
| V5 (S7) | 5e-5 | 1e-5 | 57.59% |
| V6 (S8) | 5e-5 | 1e-5 | 58.58% |
| V7 (S9) | 5e-5 | 1e-5 | 59.23% |
| V8 (S10) | 2.5e-5 | 5e-6 | 59.51% β Annealing |
| Inference | β | β | 70.08% β +10.57% from TTA |
Inference Engine
The final performance leap from 59.51% (training) to 70.08% (submission) came entirely from the inference pipeline:
| Technique | Gain | Description |
|---|---|---|
| Dense Sliding Window | +4-5% | 896Γ896 crops, stride=384px (57% overlap) |
| 2D Gaussian Kernel Blending | Eliminates artifacts | Center pixels weighted higher, edges down-weighted |
| 4-Scale TTA | +3-4% | Scales: 0.5Γ, 0.75Γ, 1.0Γ, 1.5Γ |
| H-Flip TTA | +1-2% | 8 total views per image (4 scales Γ 2 flips) |
| EMA Weights | +1-1.5% | Shadow weights used instead of raw training weights |
| AuxHead Stripping | VRAM savings | Removed before inference β not needed for prediction |
Project Structure
goose-m2f/
βββ src/
β βββ model.py β GOOSEMask2Former (FRM + AuxHead + 200 queries)
β βββ features.py β Dataset, augmentations, EMA, metrics
β βββ train.py β Training engine (Trainer class)
β βββ inference.py β Dense Gaussian patch-blending inference
βββ configs/
β βββ train_config.yaml β All training hyperparameters
β βββ infer_config.yaml β TTA and inference settings
βββ data/raw/ β Dataset (symlink or copy)
βββ models/ β Manually placed checkpoints
βββ outputs/
β βββ checkpoints/ β best_model.pth, latest.pth, charts
β βββ predictions/ β Output PNG predictions
βββ tests/
β βββ test_model.py β pytest unit tests
βββ instructions/
β βββ instructions.md β Full setup + usage guide
βββ requirements.txt
Quick Start
1. Setup
git clone https://github.com/Aditya-Lingam-9000/GOOSE-2D-FGSS-Challenge
cd GOOSE-2D-FGSS-Challenge
conda create -n goose python=3.11 -y && conda activate goose
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
accelerate config # Configure for your GPU setup
2. Configure Paths
Edit configs/train_config.yaml:
data_dir: "/path/to/goose_dataset"
csv_path: "/path/to/goose_label_mapping.csv"
output_dir: "outputs/checkpoints/session_01"
3. Train
# Single GPU
python -m src.train --config configs/train_config.yaml
# Multi-GPU
accelerate launch --num_processes 2 -m src.train --config configs/train_config.yaml
4. Inference
Edit configs/infer_config.yaml with the checkpoint path and image directory, then:
python -m src.inference --config configs/infer_config.yaml
5. Tests
pytest tests/ -v
Results
Official Leaderboard Performance (Final Submission)
| Metric | Score |
|---|---|
| Fine mIoU | ~68.5% |
| Coarse mIoU | ~71.6% |
| Official Composite | 70.08% |
Coarse Category Breakdown
| Category | mIoU |
|---|---|
| Sky | 94.6% |
| Road | 91.0% |
| Vehicle | 89.8% |
| Vegetation | 89.8% |
| Construction | 75.5% |
| Terrain | 78.9% |
| Human | 62.8% |
| Sign | 62.4% |
| Water | 33.9% |
| Object | 51.3% |
| Animal | 0.0% |
Requirements
| Package | Version |
|---|---|
| torch | β₯ 2.1.0 |
| transformers | β₯ 4.38.0 |
| accelerate | β₯ 0.27.0 |
| albumentations | β₯ 1.3.1 |
| opencv-python | β₯ 4.9.0 |
| numpy | β₯ 1.24.0 |
See requirements.txt for the complete list.
Citation
If you use this work, please cite:
@techreport{---,
title = {GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain},
author = {---},
year = {2026},
institution = {---}
}
References
- Mask2Former: Cheng et al., Masked-Attention Mask Transformer for Universal Image Segmentation, CVPR 2022
- Swin Transformer: Liu et al., ICCV 2021
- CBAM: Woo et al., Convolutional Block Attention Module, ECCV 2018
- DeepLab: Chen et al., Rethinking Atrous Convolution, TPAMI 2017
- Downloads last month
- 30