TongkunGuan commited on
Commit
f8af4fc
·
verified ·
1 Parent(s): 33a506b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -4
README.md CHANGED
@@ -52,12 +52,52 @@ The comparisons with other visual foundation models:
52
 
53
  ## TokenOCR
54
 
55
- In the following table, we provide an overview of the InternViT 2.5 series.
56
 
57
- | Model Name | HF Link |
 
 
 
 
 
 
 
 
 
 
 
 
58
  | :-----------------------: | :-------------------------------------------------------------------: |
59
- | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
60
- | InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## TokenVL
63
 
 
52
 
53
  ## TokenOCR
54
 
55
+ ### Model Architecture
56
 
57
+ An overview of the proposed TokenOCR, where the token-level image features and token-level language
58
+ features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
59
+ applications, including text segmentation, retrieval, and visual question answering.
60
+
61
+ <div align="center">
62
+ <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/QTsvWxFJFTnISdhvbfZhD.png">
63
+ </div>
64
+
65
+ ### Model Cards
66
+
67
+ In the following table, we provide all models [🤗 link] of the TokenOCR series.
68
+
69
+ | Model Name | Description |
70
  | :-----------------------: | :-------------------------------------------------------------------: |
71
+ | TokenOCR-4096-English | |
72
+ | TokenOCR-4096-Chinese | |
73
+ | TokenOCR-2048-Bilingual | |
74
+ | TokenOCR-4096-English-seg | |
75
+
76
+ ### Quick Start
77
+
78
+ > \[!Warning\]
79
+ > 🚨 Note: In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks.
80
+
81
+ ```python
82
+ import torch
83
+ from PIL import Image
84
+ from transformers import AutoModel, CLIPImageProcessor
85
+
86
+ model = AutoModel.from_pretrained(
87
+ 'OpenGVLab/InternViT-300M-448px-V2_5',
88
+ torch_dtype=torch.bfloat16,
89
+ low_cpu_mem_usage=True,
90
+ trust_remote_code=True).cuda().eval()
91
+
92
+ image = Image.open('./examples/image1.jpg').convert('RGB')
93
+
94
+ image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-300M-448px-V2_5')
95
+
96
+ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
97
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
98
+
99
+ outputs = model(pixel_values)
100
+ ```
101
 
102
  ## TokenVL
103