Enorenio
/

scribble-segmentation

+---
+license: mit
+library_name: pytorch
+tags:
+  - image-segmentation
+  - scribble-supervised
+  - pascal-voc
+  - u-net
+  - ensemble
+pipeline_tag: image-segmentation
+datasets:
+  - pascal-voc
+metrics:
+  - miou
+model-index:
+  - name: scribble-segmentation
+    results:
+      - task:
+          type: image-segmentation
+          name: Scribble-supervised binary segmentation
+        dataset:
+          type: pascal-voc
+          name: PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
+        metrics:
+          - type: miou
+            value: 0.842
+            name: Mean IoU (5-fold out-of-fold)
+          - type: bg_iou
+            value: 0.925
+            name: Background IoU
+          - type: fg_iou
+            value: 0.760
+            name: Foreground IoU
+---
+# Scribble Segmentation Ensemble
+Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model.
+## Results
+| Method | Mean IoU | Background IoU | Foreground IoU |
+|---|---|---|---|
+| Per-image K-NN baseline (k=11) | 0.499 | 0.637 | 0.361 |
+| First U-Net (no CutMix) | 0.788 | 0.900 | 0.676 |
+| U-Net trained with CutMix augmentation | 0.819 | 0.913 | 0.724 |
+| Pair of CutMix U-Nets (different random seeds), averaged | 0.832 | 0.919 | 0.743 |
+| **CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)** | **0.842** | **0.925** | **0.760** |
+The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU.
+For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868.
+## Quick start
+```bash
+git clone https://github.com/enorenio/Challenge
+cd Challenge
+pip install torch numpy pillow opencv-python scipy
+hf download Enorenio/scribble-segmentation --local-dir runs/
+python predict_ensemble.py \
+    --ckpt-dirs runs/runs_v4:64:44 runs/runs_v7_pseudo:64:47 \
+    --gpu 0
+```
+The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases.
+## What is in this repo
+Two sets of five fold checkpoints plus a tiny threshold model:
+| Path | Contents |
+|---|---|
+| `runs_v4/fold_{0..4}/best.pth` | Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images. |
+| `runs_v7_pseudo/fold_{0..4}/best.pth` | Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth. |
+| `threshold_predictor.json` | Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities. |
+At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label.
+## Model details
+The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid.
+Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4.
+The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU.
+## Strengths and weaknesses
+Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame.
+Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each.
+## Limitations
+Binary only. This model predicts foreground vs background, not multi-class semantic segmentation.
+Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably.
+Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points.
+PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation.
+## Citation
+```bibtex
+@misc{morshnev2025scribbleseg,
+  author = {Aleksey Morshnev},
+  title  = {Scribble Segmentation Ensemble},
+  year   = {2025},
+  url    = {https://github.com/enorenio/Challenge}
+}
+```
+## License
+MIT for the model weights and inference code. PASCAL VOC dataset has its own license.