Edit model card

Tokenize Anything via Prompting

Ting Pan1,2*,   Lulu Tang2*,   Xinlong Wang2ยถ,   Shiguang Shan1

1ICT-CAS,   2BAAI
* Equal Contribution, ยถProject Lead

[Paper] [๐Ÿค— Demo]

We present Tokenize Anything via Prompting, a unified and promptable model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box and sketch). The model is trained with exhaustive segmentation masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP with 5 billion parameters.

Installation

See Github Page.

Models

Model weights

Two versions of the model are available with different image encoders.

Model Description Schedule MD5 Weights
tap_vit_l ViT-L TAP v1.1 model (100% SA-1B, 180k), (VG, 50ep) c1d41f ๐Ÿค— HF link
tap_vit_b ViT-B TAP v1.1 model (100% SA-1B, 180k), (VG, 50ep) 707f80 ๐Ÿค— HF link
tap_vit_l ViT-L TAP v1.0 model (50% SA-1B, 90k), (VG, 25ep) 03f8ec ๐Ÿค— HF link
tap_vit_b ViT-B TAP v1.0 model (50% SA-1B, 90k), (VG, 25ep) b45cbf ๐Ÿค— HF link

V1.1 Release Notes

  • Use a longer pre-training and fine-tuning schedule (improved segmentation and caption performance).
  • Apply weight decay for all bias parameters (avoid FP16 overflow in QK matmul).
  • Sample point prompts from predicted mask instead of GT box during VG training.

Concept weights

Note: You can generate these weights following the Concept Guide.

Concept Description Weights
Merged-2560 Merged concepts ๐Ÿค— HF link
LVIS-1203 LVIS concepts ๐Ÿค— HF link
COCO-80 COCO concepts ๐Ÿค— HF link

License

Apache License 2.0

Citation

@article{pan2023tap,
  title={Tokenize Anything via Prompting},
  author={Pan, Ting and Tang, Lulu and Wang, Xinlong and Shan, Shiguang},
  journal={arXiv preprint arXiv:2312.09128},
  year={2023}
}

Acknowledgement

We thank the repositories: SAM, EVA, LLaMA, FlashAttention, Gradio, Detectron2 and CodeWithGPU.

Downloads last month
0
Inference API
Unable to determine this model's library. Check the docs .