REG2026 Interface-1 (Workflow Reasoning / Metric A) β Submission Container
daipath/reg2026-submit Β· self-contained Docker algorithm for the MICCAI REG2026 (REGΒ²)
challenge, Interface-1. Given one anonymized whole-slide image it produces the
chain-of-thought + final pathology report that the official Metric A scorer expects.
Honest held-out Metric A (val 1019, organ predicted β no ground truth used): 0.9037 (5-TransMIL ensemble). Single-model variant 0.8950. GT-organ oracle upper bound 0.9062.
0. What is in this repo
| File | Size | What |
|---|---|---|
reg2026_submit.tar |
~2.7 GB | the complete, self-contained submission container (code + all weights + sample WSI) |
README.md |
β | this document |
tar -xf reg2026_submit.tar gives the folder reg2026_submit/:
reg2026_submit/
βββ Dockerfile requirements.txt # pytorch:2.9.1-cuda12.6 base + numpy/Pillow/tifffile/imagecodecs/zarr/opencv/timm
βββ inference.py core.py # official platform contract (unchanged); MODEL_PATH=/opt/ml/model
βββ build_debug.sh # one-shot: build image + run on the sample WSI + validate output
βββ batch_infer.py dryrun.py # offline (non-docker) batch / single-slide runners
βββ src/interf1/model.py # predict_chain_of_thought(wsi_path) β the whole pipeline
βββ src/reg/ {patching,uni2,mil,organ,cot,derive_heads,report_gen}.py
βββ model/ # copied to /opt/ml/model in the image
β βββ uni2-h.bin # UNI2-h encoder weights (~2.7 GB)
β βββ mil_transmil_s0..s4.pt # 5-seed TransMIL ensemble (~23 MB each)
β βββ organ_clf.npz # organ classifier (numpy: scaler + logistic-regression)
β βββ routing_smart.json # CoT routing rules (2375 entries)
β βββ label_space.json # answer options per head
β βββ canonical_questions.json # normalized -> canonical question strings
βββ test/input/interf1/... # the official debug WSI d021e460 + inputs.json
1. Can it be run directly? (Quickstart)
Requirements on the target machine: Docker daemon access + an NVIDIA GPU +
nvidia-container-toolkit; ~10 GB disk for the image; β₯12 GB GPU RAM.
# download (private repo -> needs your HF token)
hf download daipath/reg2026-submit reg2026_submit.tar --repo-type model --local-dir . --token <HF_TOKEN>
tar -xf reg2026_submit.tar && cd reg2026_submit
# build + debug on the sample slide in one shot
bash build_debug.sh
# -> builds image `reg2026_algorithm`, runs it on test/input/interf1, prints a validated summary
Manual run on any case folder (must contain images/whole-slide-image/<uid>.tiff):
docker run --rm --gpus all --platform=linux/amd64 \
-v "$PWD/test/input/interf1:/input:ro" -v "$PWD/test/output:/output" reg2026_algorithm
cat test/output/chain-of-thought.json
Expected on the debug slide: ~43-step CoT, organ=Breast, exactly one terminal step, a
final-report step reading Breast, core needle biopsy; 1. Invasive carcinoma NST grade I ....
No-Docker smoke test (e.g. on the training server, conda env with torch+timm+zarr):
REG_MODEL_PATH=./model python dryrun.py <some>.tiff, or batch a folder with
batch_infer.py --indir <dir> --outdir <dir> --ckpts model/mil_transmil_s0.pt,....
2. Model / pipeline
predict_chain_of_thought(wsi_path) in src/interf1/model.py:
WSI .tiff (single-level tiled JPEG, 20x)
ββ patch_wsi_path : tissue segmentation (HSV-saturation Otsu) + 256-px grid @ tissueβ₯0.25,
β read straight from the tiled tiff via tifffile+zarr (memory-bounded; a
β multi-GB level-0 array would OOM under a full imread). Cap to β€8192
β patches (RandomState(0)) before reading pixels.
ββ UNI2-h : 1536-d feature per patch (timm vit_giant_patch14_224, SwiGLU, fp16)
ββ organ classifier: mean|max|std pool (4608-d) -> StandardScaler -> logistic regression
β -> coarse organ (7-way). [val acc 0.986]
ββ MIL ensemble : 5Γ TransMIL multi-head, FiLM-conditioned on the predicted organ,
β softmax-averaged -> 85 head answers
ββ derive heads : rule-based (Gleason->grade group, Nottingham->overall grade, ...)
ββ report_gen : rule-based structured pathology report from the answered heads
ββ assemble_edges : per-organ CoT routing (routing_smart.json) -> canonical Q/A/next steps,
final-report node carries the report, last next_question = ""
The organ-head fix (important)
The MIL was trained with the ground-truth organ fed via FiLM, so its "What is the organ?"
head learned to echo the FiLM input rather than read the tissue. Queried at inference with a
placeholder organ it returns the same class for every slide (measured: 1019/1019 "breast",
acc 0.215). Our offline 0.90+ numbers had silently relied on the GT organ for both FiLM and
routing. The container therefore predicts the organ with a separate logistic-regression
classifier on pooled UNI2 features (no FiLM, val acc 0.986) and feeds that prediction to
FiLM + routing + the organ answer. Net effect on Metric A vs the GT-organ oracle: only β0.0025.
Metric A (official scorer, semantic_backend=lexical)
Metric A = 0.05Β·BPV + 0.30Β·Edge-F1 + 0.25Β·MESS + 0.40Β·FinalReport.
| config (cap 8192) | organ | Metric A | BPV / Edge-F1 / MESS / Report |
|---|---|---|---|
| 5-TransMIL ensemble | GT (oracle) | 0.9062 | 0.779 / 0.980 / 0.930 / 0.852 |
| 5-TransMIL ensemble (default) | predicted | 0.9037 | 0.773 / 0.977 / 0.926 / 0.851 |
| single TransMIL | predicted | 0.8950 | 0.753 / 0.974 / 0.920 / 0.838 |
Measured on the held-out 10% val split (1019 slides, all with GT CoT + report). On the real Test Phase 1 set (350 slides, no GT) the pipeline runs clean: 350/350 valid, organ distribution sensible (87 cervix β the 70 uterus slides + remainder spread evenly), reports organ-appropriate.
3. How it was trained
Data. REG2026 train_CoT.json = 11220 annotated cases, each with a full chain-of-thought
(median 16 steps) and a final pathology report (99.9%). Split (split3.json):
train 8176 / val 1019 / test 1019. Organs are 7 coarse classes
(breast, colon, stomach, prostate, bladder, lung, cervix).
1) Patching (wsi_patching.py). 256-px tiles at 20x; tissue mask via HSV-saturation
Otsu + morphology + min-area; keep tiles with tissue fraction β₯ 0.25. Training read tiles
lazily with OpenSlide/pyvips read_region (memory-bounded). The container reproduces the
identical patch set via tifffile+zarr (verified: same coordinates and pixels).
2) Feature extraction (feature_extract.py). Each patch β UNI2-h (timm
vit_giant_patch14_224, SwiGLUPacked, SiLU, 8 register tokens, embed-dim 1536), pipeline
/255 β resize 224 bilinear β ImageNet norm β fp16. Per slide the patches are capped to 8192
with np.random.RandomState(0).choice (pack_cap.py), stored as capped_8192_uni2.h5
(per-slide [Nβ€8192, 1536]) plus pooled_8192_uni2.npz (mean|max|std β 4608-d).
3) Multi-head MIL (train_mil.py, MultiHeadMIL). proj Linear(1536β512)+ReLU+Dropout(0.25)
β aggregator β emb β FiLM organ conditioning β 85 per-question linear heads.
- Aggregator: TransMIL (transformer pooling with internal patch cap
agg_cap=8192). ABMIL (gated attention) / CLAM / MambaMIL also implemented. - FiLM:
Embedding(7 organs, 512)β Ξ³, Ξ² βembΒ·(1+Ξ³)+Ξ². Conditioned on the GT organ during training (helps the 84 non-organ heads share an organ-specific vocabulary). - Loss: focal cross-entropy with per-class weights, summed only over the heads applicable to that slide (the CoT actually asked). CLAM adds a 0.3 instance-clustering term.
- Optim: AdamW
lr=1e-4,weight_decay=1e-4, CosineAnnealingLR, 40 epochs, early-stop patience 8, best checkpoint by weighted val accuracy. - 5 seeds (s0βs4) β softmax-averaged ensemble at inference.
4) Organ classifier (organ.py / organ_clf.npz). The deployed organ predictor is a
multinomial logistic regression on the 4608-d pooled features (StandardScaler + LR), trained
on train (val held out β 0.986 val accuracy); the shipped artifact is refit on train+val.
Exported as pure numpy (mean, scale, coef, intercept, classes) β no sklearn at inference.
5) Rule-based stages (no learning).
derive_heads.py: derives 6 heads from predicted answers (Gleason pattern β grade group, Nottingham sub-scores β overall grade, differentiation, number-of-diagnoses, etc.).report_gen.py: assembles a structured pathology report from the answered heads.assemble_edges+routing_smart.json: per-organ deterministic CoT graph (2375 rules = base keys + on-demand context keys + ambiguous keys); pure-rule path validity is 1.0.
4. Variants & notes
- Default = 5-TransMIL ensemble (
MIL_CKPTSinsrc/interf1/model.py). For a lighter, faster single-model build setMIL_CKPTS = ["mil_transmil_s0.pt"](Metric A 0.8950). - Large slides: the biggest Test-Phase-1 slide decompresses to ~104 GB; patching is fully memory-bounded and multi-threaded (peak RSS ~2.3 GB on a 12.9 GB slide), so the container does not OOM regardless of slide size.
- The platform passes one slide per run at
/input/images/whole-slide-image/<uid>.tiffand expects/output/chain-of-thought.json(a bare JSON array of {question, answer, next_question}, last next_question = "", canonical question strings, real newlines).
5. Submitting to grand-challenge
Test Phase 1 ground truth is not public, so Metric A on those 350 slides is obtained only by
submitting this container; the held-out-val 0.9037 above is the realistic estimate. To submit,
build the image, then bash do_save.sh to produce the upload archive (or push the image per
the challenge instructions).
License of underlying data: CC-BY-NC-SA (REG2026). UNI2-h weights are governed by their own MahmoodLab license.