TongkunGuan
/

TokenFD

Model card Files Files and versions

TongkunGuan commited on Feb 21

Commit

f8af4fc

·

verified ·

1 Parent(s): 33a506b

Update README.md

Files changed (1) hide show

README.md +44 -4

README.md CHANGED Viewed

@@ -52,12 +52,52 @@ The comparisons with other visual foundation models:
 ## TokenOCR
-In the following table, we provide an overview of the InternViT 2.5 series.
-|        Model Name         |                                HF Link                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
-| InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
-|  InternViT-6B-448px-V2_5  |  [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5)  |
 ## TokenVL

 ## TokenOCR
+### Model Architecture
+An overview of the proposed TokenOCR, where the token-level image features and token-level language
+features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
+applications, including text segmentation, retrieval, and visual question answering.
+<div align="center">
+  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/QTsvWxFJFTnISdhvbfZhD.png">
+</div>
+### Model Cards
+In the following table, we provide all models [🤗 link] of the TokenOCR series.
+|        Model Name         |                                Description                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
+| TokenOCR-4096-English |  |
+|  TokenOCR-4096-Chinese  |    |
+|  TokenOCR-2048-Bilingual  |    |
+| TokenOCR-4096-English-seg |  |
+### Quick Start
+> \[!Warning\]
+> 🚨 Note: In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks.
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternViT-300M-448px-V2_5',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+image = Image.open('./examples/image1.jpg').convert('RGB')
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-300M-448px-V2_5')
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+outputs = model(pixel_values)
+```
 ## TokenVL