Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
tags:
|
| 5 |
+
- image-segmentation
|
| 6 |
+
- scribble-supervised
|
| 7 |
+
- pascal-voc
|
| 8 |
+
- u-net
|
| 9 |
+
- ensemble
|
| 10 |
+
pipeline_tag: image-segmentation
|
| 11 |
+
datasets:
|
| 12 |
+
- pascal-voc
|
| 13 |
+
metrics:
|
| 14 |
+
- miou
|
| 15 |
+
model-index:
|
| 16 |
+
- name: scribble-segmentation
|
| 17 |
+
results:
|
| 18 |
+
- task:
|
| 19 |
+
type: image-segmentation
|
| 20 |
+
name: Scribble-supervised binary segmentation
|
| 21 |
+
dataset:
|
| 22 |
+
type: pascal-voc
|
| 23 |
+
name: PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
|
| 24 |
+
metrics:
|
| 25 |
+
- type: miou
|
| 26 |
+
value: 0.842
|
| 27 |
+
name: Mean IoU (5-fold out-of-fold)
|
| 28 |
+
- type: bg_iou
|
| 29 |
+
value: 0.925
|
| 30 |
+
name: Background IoU
|
| 31 |
+
- type: fg_iou
|
| 32 |
+
value: 0.760
|
| 33 |
+
name: Foreground IoU
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
# Scribble Segmentation Ensemble
|
| 37 |
+
|
| 38 |
+
Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model.
|
| 39 |
+
|
| 40 |
+
## Results
|
| 41 |
+
|
| 42 |
+
| Method | Mean IoU | Background IoU | Foreground IoU |
|
| 43 |
+
|---|---|---|---|
|
| 44 |
+
| Per-image K-NN baseline (k=11) | 0.499 | 0.637 | 0.361 |
|
| 45 |
+
| First U-Net (no CutMix) | 0.788 | 0.900 | 0.676 |
|
| 46 |
+
| U-Net trained with CutMix augmentation | 0.819 | 0.913 | 0.724 |
|
| 47 |
+
| Pair of CutMix U-Nets (different random seeds), averaged | 0.832 | 0.919 | 0.743 |
|
| 48 |
+
| **CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)** | **0.842** | **0.925** | **0.760** |
|
| 49 |
+
|
| 50 |
+
The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU.
|
| 51 |
+
|
| 52 |
+
For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868.
|
| 53 |
+
|
| 54 |
+
## Quick start
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
git clone https://github.com/enorenio/Challenge
|
| 58 |
+
cd Challenge
|
| 59 |
+
pip install torch numpy pillow opencv-python scipy
|
| 60 |
+
|
| 61 |
+
hf download Enorenio/scribble-segmentation --local-dir runs/
|
| 62 |
+
|
| 63 |
+
python predict_ensemble.py \
|
| 64 |
+
--ckpt-dirs runs/runs_v4:64:44 runs/runs_v7_pseudo:64:47 \
|
| 65 |
+
--gpu 0
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases.
|
| 69 |
+
|
| 70 |
+
## What is in this repo
|
| 71 |
+
|
| 72 |
+
Two sets of five fold checkpoints plus a tiny threshold model:
|
| 73 |
+
|
| 74 |
+
| Path | Contents |
|
| 75 |
+
|---|---|
|
| 76 |
+
| `runs_v4/fold_{0..4}/best.pth` | Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images. |
|
| 77 |
+
| `runs_v7_pseudo/fold_{0..4}/best.pth` | Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth. |
|
| 78 |
+
| `threshold_predictor.json` | Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities. |
|
| 79 |
+
|
| 80 |
+
At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label.
|
| 81 |
+
|
| 82 |
+
## Model details
|
| 83 |
+
|
| 84 |
+
The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid.
|
| 85 |
+
|
| 86 |
+
Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4.
|
| 87 |
+
|
| 88 |
+
The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU.
|
| 89 |
+
|
| 90 |
+
## Strengths and weaknesses
|
| 91 |
+
|
| 92 |
+
Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame.
|
| 93 |
+
|
| 94 |
+
Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each.
|
| 95 |
+
|
| 96 |
+
## Limitations
|
| 97 |
+
|
| 98 |
+
Binary only. This model predicts foreground vs background, not multi-class semantic segmentation.
|
| 99 |
+
|
| 100 |
+
Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably.
|
| 101 |
+
|
| 102 |
+
Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points.
|
| 103 |
+
|
| 104 |
+
PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation.
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
```bibtex
|
| 109 |
+
@misc{morshnev2025scribbleseg,
|
| 110 |
+
author = {Aleksey Morshnev},
|
| 111 |
+
title = {Scribble Segmentation Ensemble},
|
| 112 |
+
year = {2025},
|
| 113 |
+
url = {https://github.com/enorenio/Challenge}
|
| 114 |
+
}
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## License
|
| 118 |
+
|
| 119 |
+
MIT for the model weights and inference code. PASCAL VOC dataset has its own license.
|