Enorenio commited on
Commit
53b20c9
·
verified ·
1 Parent(s): 6c41493

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: pytorch
4
+ tags:
5
+ - image-segmentation
6
+ - scribble-supervised
7
+ - pascal-voc
8
+ - u-net
9
+ - ensemble
10
+ pipeline_tag: image-segmentation
11
+ datasets:
12
+ - pascal-voc
13
+ metrics:
14
+ - miou
15
+ model-index:
16
+ - name: scribble-segmentation
17
+ results:
18
+ - task:
19
+ type: image-segmentation
20
+ name: Scribble-supervised binary segmentation
21
+ dataset:
22
+ type: pascal-voc
23
+ name: PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
24
+ metrics:
25
+ - type: miou
26
+ value: 0.842
27
+ name: Mean IoU (5-fold out-of-fold)
28
+ - type: bg_iou
29
+ value: 0.925
30
+ name: Background IoU
31
+ - type: fg_iou
32
+ value: 0.760
33
+ name: Foreground IoU
34
+ ---
35
+
36
+ # Scribble Segmentation Ensemble
37
+
38
+ Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model.
39
+
40
+ ## Results
41
+
42
+ | Method | Mean IoU | Background IoU | Foreground IoU |
43
+ |---|---|---|---|
44
+ | Per-image K-NN baseline (k=11) | 0.499 | 0.637 | 0.361 |
45
+ | First U-Net (no CutMix) | 0.788 | 0.900 | 0.676 |
46
+ | U-Net trained with CutMix augmentation | 0.819 | 0.913 | 0.724 |
47
+ | Pair of CutMix U-Nets (different random seeds), averaged | 0.832 | 0.919 | 0.743 |
48
+ | **CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)** | **0.842** | **0.925** | **0.760** |
49
+
50
+ The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU.
51
+
52
+ For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868.
53
+
54
+ ## Quick start
55
+
56
+ ```bash
57
+ git clone https://github.com/enorenio/Challenge
58
+ cd Challenge
59
+ pip install torch numpy pillow opencv-python scipy
60
+
61
+ hf download Enorenio/scribble-segmentation --local-dir runs/
62
+
63
+ python predict_ensemble.py \
64
+ --ckpt-dirs runs/runs_v4:64:44 runs/runs_v7_pseudo:64:47 \
65
+ --gpu 0
66
+ ```
67
+
68
+ The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases.
69
+
70
+ ## What is in this repo
71
+
72
+ Two sets of five fold checkpoints plus a tiny threshold model:
73
+
74
+ | Path | Contents |
75
+ |---|---|
76
+ | `runs_v4/fold_{0..4}/best.pth` | Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images. |
77
+ | `runs_v7_pseudo/fold_{0..4}/best.pth` | Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth. |
78
+ | `threshold_predictor.json` | Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities. |
79
+
80
+ At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label.
81
+
82
+ ## Model details
83
+
84
+ The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid.
85
+
86
+ Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4.
87
+
88
+ The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU.
89
+
90
+ ## Strengths and weaknesses
91
+
92
+ Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame.
93
+
94
+ Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each.
95
+
96
+ ## Limitations
97
+
98
+ Binary only. This model predicts foreground vs background, not multi-class semantic segmentation.
99
+
100
+ Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably.
101
+
102
+ Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points.
103
+
104
+ PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation.
105
+
106
+ ## Citation
107
+
108
+ ```bibtex
109
+ @misc{morshnev2025scribbleseg,
110
+ author = {Aleksey Morshnev},
111
+ title = {Scribble Segmentation Ensemble},
112
+ year = {2025},
113
+ url = {https://github.com/enorenio/Challenge}
114
+ }
115
+ ```
116
+
117
+ ## License
118
+
119
+ MIT for the model weights and inference code. PASCAL VOC dataset has its own license.