InstinctSAM — ViT-B vision encoder for compressed SAM3 (concept / box / point)
Distilled ViT-B/16 vision encoder that recovers SAM3's three promptable-segmentation paths (text/concept, box, point) at ≥80 % of the teacher with a vision encoder ~4.7× smaller than SAM3's ~463M PE-ViT.
Code & full write-up: https://github.com/william-Dic/InstinctSAM
(see docs/CONCEPT_DISTILL.md).
Results (held-out COCO val2017; teacher = official facebook/sam3)
| Vision encoder | Params | Stage-A cos | Concept IoU (n=500) | Box mIoU | Point mIoU |
|---|---|---|---|---|---|
| SAM3 teacher | ~463M | 1.00 | 0.798 (100%) | 0.940 (100%) | 0.676 (100%) |
| TinyViT-11M | ~28M | 0.66 | 0.513 (64%) | 0.743 (79%) | 0.521 (77%) |
| TinyViT-21M | ~31M | 0.66 | 0.584 (73%) | 0.743 (79%) | 0.518 (77%) |
| ViT-B/16 (this) | ~99M | 0.75 | 0.751 (94%) | 0.805 (86%) | 0.545 (81%) |
Key finding: SAM3 concept distillation is vision-encoder-capacity-bound — not text/data/objective-bound. TinyViT saturates at cosine 0.66 to the teacher's features (both output-KD and region-text alignment plateau at ~0.58 concept on 21M); a higher-fidelity ViT-B (Stage-A cosine 0.75) jumps concept to 0.751 (94 % of teacher).
Files (our original weights only — NO Meta-gated SAM3 weights)
vit_base_stageA.pt— ViT-B trunk after Stage-A feature distillation to the teacher's(1024, 72, 72)trunk feature (cosine 0.75).concept_vitb_trunk_step6000.pt— ViT-B trunk after end-to-end concept self-distillation (the final model's vision encoder). This is the trunk that produces the results above.
Both are {'trunk': state_dict, 'backbone': 'vit_base', 'model_name': 'base'}.
License / how to assemble the full model
These files contain only the ViT-B vision-encoder weights we trained — they do not include SAM3's detector / mask-decoder / scoring / presence heads, which are Meta-gated under the SAM License.
To run the full model, obtain SAM3 from Meta (gated) and assemble per the repo:
# 1. get SAM3 (Meta-gated) -> checkpoints/official/sam3.pt
# 2. build merged init (SAM3 heads + this ViT-B trunk + distilled MobileCLIP-S1 text)
python src/build_merged_stage3.py --backbone vit_base --model_name base \
--trunk vit_base_stageA.pt --text_type MobileCLIP-S1 --out merged_vitb.pt
# 3. (optional) reproduce concept training end-to-end
bash scripts/run_vitb_full.sh
# eval:
python src/eval_concept_es3.py --ckpt <merged-or-concept.pt> --backbone vit_base --model_name base \
--text_type MobileCLIP-S1 --ann annotations/instances_val2017_eval500.json --n_images 500
Method (brief)
End-to-end self-distillation of SAM3's concept/DETR path (teacher↔student Hungarian detection matching + soft KD + CLoCKDistill encoder-memory KD + region-text alignment + EMA), on top of a ViT-B trunk feature-distilled to the teacher. The teacher supplies all targets; no ground-truth labels are used.