File size: 2,787 Bytes
eda12af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
<div align="center">

<h1>Tokenize Anything via Prompting</h1>

[Ting Pan](https://github.com/PhyscalX/)<sup>1,2*</sup>, &nbsp; [Lulu Tang]()<sup>2*</sup>, &nbsp; [Xinlong Wang](https://www.xloong.wang/)<sup></sup>, &nbsp; [Shiguang Shan](https://scholar.google.com/citations?user=Vkzd7MIAAAAJ&hl=en)<sup>1</sup>

<sup>1</sup>[ICT-CAS](http://english.ict.cas.cn/), &nbsp; <sup>2</sup>[BAAI](https://www.baai.ac.cn/english.html)<br>
<sup>*</sup> Equal Contribution, <sup></sup>Project Lead

</div>

We present **T**okenize **A**nything via **P**rompting, a unified and promptable model capable of simultaneously segmenting, recognizing, and captioning objects within arbitrary regions, only relaying on visual prompts (point, box and sketch). The model is trained with exhaustive segmentation
masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP with 5 billion parameters.

## Installation

See [Github Page](https://github.com/baaivision/tokenize-anything).

## Models

### Model weights

Two versions of the model are available with different image encoders.

| Model | Description | Weights |
| ----- | ------------| ------ |
| **tap_vit_l** | ViT-L TAP model | [🤗 HF link](https://huggingface.co/BAAI/tokenize-anything/blob/main/models/tap_vit_l_03f8ec.pkl) |
| **tap_vit_b** | ViT-B TAP model | [🤗 HF link](https://huggingface.co/BAAI/tokenize-anything/blob/main/models/tap_vit_b_b45cbf.pkl) |

### Concept weights

***Note***: You can generate these weights following the [Concept Guide](https://github.com/baaivision/tokenize-anything/blob/main/notebooks/concept.ipynb).

| Concept | Description | Weights |
| ------- | ------------| ------ |
| **Merged-2560** | Merged concepts | [🤗 HF link](https://huggingface.co/BAAI/tokenize-anything/blob/main/concepts/merged_2560.pkl) |
| **LVIS-1203**   | LVIS concepts | [🤗 HF link](https://huggingface.co/BAAI/tokenize-anything/blob/main/models/lvis_1203.pkl) |
| **COCO-80**   | COCO concepts  | [🤗 HF link](https://huggingface.co/BAAI/tokenize-anything/blob/main/models/coco_80.pkl) |

## License
[Apache License 2.0](LICENSE)

## Citation

```
@article{pan2023tap,
  title={Tokenize Anything via Prompting},
  author={Pan, Ting and Tang, Lulu and Wang, Xinlong and Shan, Shiguang},
  journal={arXiv preprint arXiv:2312.yyyyy},
  year={2023}
}
```

## Acknowledgement

We thank the repositories: [SAM](https://github.com/facebookresearch/segment-anything), [EVA](https://github.com/baaivision/EVA), [LLaMA](https://github.com/facebookresearch/llama), [FlashAttention](https://github.com/Dao-AILab/flash-attention), [Gradio](https://github.com/gradio-app/gradio), [Detectron2](https://github.com/facebookresearch/detectron2) and [CodeWithGPU](https://github.com/seetacloud/codewithgpu).