File size: 6,634 Bytes
eb2bf5f
 
 
 
 
 
 
 
 
 
 
 
 
210ac96
 
 
1f95db9
10d2cf0
210ac96
db75d48
 
 
210ac96
43aaefc
 
14e4776
 
 
 
 
 
 
 
 
 
210ac96
 
 
 
 
 
 
 
 
 
 
 
 
 
43aaefc
 
 
 
 
 
 
 
b22a631
43aaefc
 
 
 
800a4c7
43aaefc
 
 
 
 
 
b22a631
43aaefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ea3f31
 
 
 
 
 
 
 
 
43aaefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f95db9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: apache-2.0
datasets:
- unsloth/LaTeX_OCR
- linxy/LaTeX_OCR
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Math
- OCR
- Latex
- VLM
- Plain_Text
- ITT
---
# Qwen2-VL-OCR-2B-Instruct [ VL / OCR ]

![aaaaaaaaaaa.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/s42kASSQCoJAyYMJkoEuD.png)

The **Qwen2-VL-OCR-2B-Instruct** model is a fine-tuned version of **Qwen/Qwen2-VL-2B-Instruct**, tailored for tasks that involve **Optical Character Recognition (OCR)**, **image-to-text conversion**, and **math problem solving with LaTeX formatting**. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.

#### Key Enhancements:

* **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

* **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

* **Agent that can operate your mobiles, robots, etc.**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

* **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

| **File Name**             | **Size**   | **Description**                                 | **Upload Status** |
|---------------------------|------------|------------------------------------------------|-------------------|
| `.gitattributes`          | 1.52 kB   | Configures LFS tracking for specific model files. | Initial commit    |
| `README.md`               | 203 Bytes | Minimal details about the uploaded model.       | Updated           |
| `added_tokens.json`       | 408 Bytes | Additional tokens used by the model tokenizer.  | Uploaded          |
| `chat_template.json`      | 1.05 kB   | Template for chat-based model input/output.     | Uploaded          |
| `config.json`             | 1.24 kB   | Model configuration metadata.                   | Uploaded          |
| `generation_config.json`  | 252 Bytes | Configuration for text generation settings.     | Uploaded          |
| `merges.txt`              | 1.82 MB   | BPE merge rules for tokenization.               | Uploaded          |
| `model.safetensors`       | 4.42 GB   | Serialized model weights in a secure format.    | Uploaded (LFS)    |
| `preprocessor_config.json`| 596 Bytes | Preprocessing configuration for input data.     | Uploaded          |
| `vocab.json`              | 2.78 MB   | Vocabulary file for tokenization.               | Uploaded          |

---
### How to Use

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen2-VL-OCR-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/Qwen2-VL-OCR-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-OCR-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### Buf
```python
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        # Remove <|im_end|> or similar tokens from the output
        buffer = buffer.replace("<|im_end|>", "")
        yield buffer
```
### **Key Features**

1. **Vision-Language Integration:**  
   - Combines **image understanding** with **natural language processing** to convert images into text.  

2. **Optical Character Recognition (OCR):**  
   - Extracts and processes textual information from images with high accuracy.

3. **Math and LaTeX Support:**  
   - Solves math problems and outputs equations in **LaTeX format**.

4. **Conversational Capabilities:**  
   - Designed to handle **multi-turn interactions**, providing context-aware responses.

5. **Image-Text-to-Text Generation:**  
   - Inputs can include **images, text, or a combination**, and the model generates descriptive or problem-solving text.

6. **Secure Weight Format:**  
   - Uses **Safetensors** for faster and more secure model weight loading.

---

### **Training Details**

- **Base Model:** [Qwen/Qwen2-VL-2B-Instruct](#)  
- **Model Size:**  
   - 2.21 Billion parameters  
   - Optimized for **BF16** tensor type, enabling efficient inference.

- **Specializations:**  
   - OCR tasks in images containing text.
   - Mathematical reasoning and LaTeX output for equations.

---