JackChew commited on
Commit
582bcd5
·
verified ·
1 Parent(s): 7288e09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -4
README.md CHANGED
@@ -1,10 +1,10 @@
1
  ---
2
  base_model: unsloth/Qwen2-VL-2B-Instruct
3
  tags:
4
- - text-generation-inference
5
  - transformers
6
- - unsloth
7
- - qwen2_vl
8
  license: apache-2.0
9
  language:
10
  - en
@@ -14,8 +14,140 @@ language:
14
 
15
  - **Developed by:** JackChew
16
  - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/Qwen2-VL-2B-Instruct
18
 
19
  This qwen2_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model: unsloth/Qwen2-VL-2B-Instruct
3
  tags:
4
+ - text-generation-inference,text-extraction
5
  - transformers
6
+ - unsloth/Qwen2-VL-2B-Instruct-16Bit
7
+ Base Model: unsloth/Qwen2-VL-2B-Instruct-16Bit
8
  license: apache-2.0
9
  language:
10
  - en
 
14
 
15
  - **Developed by:** JackChew
16
  - **License:** apache-2.0
17
+ - **Finetuned from model :** unsloth/Qwen2-VL-2B-Instruct-16Bit
18
 
19
  This qwen2_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
22
+
23
+
24
+ ## Model Description
25
+ **通义千问 QWEN OCR** is a proprietary model focused on text extraction, specifically designed for extracting text from images of documents, tables, and payslips. The primary goal of this model is to extract COMPLETE/FULL text from images while ensuring that no information is missed.
26
+
27
+ Qwen2-VL-2B-OCR is a fine-tuned variant of unsloth/Qwen2-VL-2B-Instruct, optimized specifically for Optical Character Recognition (OCR). This model is trained to extract full and complete text from images, with a focus on documents such as payslips, invoices, and tables. The model aims to provide accurate text extraction with minimal loss of information, ensuring that every detail is captured.
28
+ This model uses cutting-edge techniques for text-to-text generation from images and works seamlessly for various OCR tasks, including text from complex documents with structured layouts.
29
+
30
+ ## Intended Use
31
+ Intended Use
32
+ The primary purpose of the model is to extract data from images or documents, especially from payslips and tables, without missing any critical details. It can be applied in various domains such as payroll systems, finance, legal document analysis, and any field where document extraction is required.
33
+ Prompt Example:
34
+ - **text**: The model will BEST WORK to this `"Extract all text from image/payslip without miss anything"`.
35
+
36
+
37
+ ## Model Benchmark
38
+
39
+ After fine-tuning, this model has significantly improved in extracting all relevant sections from the payslip, including the previously missing **Deductions** section.
40
+
41
+ ### Example Output Comparison
42
+
43
+
44
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/676ed40d25c39d8bd5d6f759/KOAZouqb1qH7toZO6YZsO.png)
45
+
46
+ #### Fine-tuned Model:
47
+ Here is the extracted data from the payslip:
48
+
49
+ **Employee Information:**
50
+ - Date of Joining: 2018-06-23
51
+ - Pay Period: August 2021
52
+ - Employee Name: Sally Harley
53
+ - Designation: Marketing Executive
54
+ - Department: Marketing
55
+
56
+ **Earnings:**
57
+ | Earnings | Amount | Deductions | Amount |
58
+ |------------------|--------|-------------------|--------|
59
+ | Basic | 10000 | Provident Fund | 1200 |
60
+ | Incentive | 1000 | Professional Tax | 500 |
61
+ | House Rent | 400 | Loan | 400 |
62
+ | Meal Allowance | 200 | | 9500 |
63
+
64
+ **Total Earnings:** $11,600
65
+ **Total Deductions:** $2,100
66
+ **Net Pay:** $9,500
67
+
68
+ **Employer Signature**
69
+ **Employee Signature**
70
+
71
+ ---
72
+ #### Original Model:
73
+ The original model extracted the following data but missed the **Deductions** section:
74
+
75
+ - **Date of Joining**: 2018-06-23
76
+ - **Pay Period**: August 2021
77
+ - **Employee Name**: Sally Harley
78
+ - **Designation**: Marketing Executive
79
+ - **Department**: Marketing
80
+ - **Earnings**:
81
+ - Basic: $10,000
82
+ - Incentive Pay: $1,000
83
+ - House Rent Allowance: $400
84
+ - Meal Allowance: $200
85
+ - **Total Earnings**: $11,600
86
+ - **Total Deductions**: $2,100
87
+ - **Net Pay**: $9,500
88
+ - **Employer Signature**: [Signature]
89
+ - **Employee Signature**: [Signature]
90
+ - **This is system-generated payslip**
91
+
92
+
93
+ ## Quick Start
94
+ Here’s an example code snippet to get started with this model:
95
+
96
+ ### Loading the Model and Processor
97
+ ```python
98
+ from transformers import AutoProcessor, AutoModelForImageTextToText
99
+
100
+ processor = AutoProcessor.from_pretrained("JackChew/Qwen2-VL-2B-OCR")
101
+ model = AutoModelForImageTextToText.from_pretrained("JackChew/Qwen2-VL-2B-OCR")
102
+
103
+ ```
104
+ ### Loading an Image
105
+ ```python
106
+ # Load your image
107
+ from PIL import Image
108
+ image_path = "xxxxx" # Replace with your image path
109
+ image = Image.open(image_path)
110
+ ```
111
+ ### Preparing the Model, Preprocessing Inputs, and Performing Inference
112
+ ```python
113
+ import requests
114
+ import torch
115
+ from torchvision import io
116
+ from typing import Dict
117
+
118
+ model = model.to("cuda")
119
+ conversation = [
120
+ {
121
+ "role":"user",
122
+ "content":[
123
+ {
124
+ "type":"image",
125
+ },
126
+ {
127
+ "type":"text",
128
+ "text":"extract all data from this payslip without miss anything"
129
+ }
130
+ ]
131
+ }
132
+ ]
133
+
134
+
135
+ # Preprocess the inputs
136
+ text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
137
+ # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
138
+
139
+ inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
140
+ inputs = inputs.to('cuda')
141
+
142
+ # Inference: Generation of the output
143
+ output_ids = model.generate(**inputs, max_new_tokens=2048)
144
+ generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
145
+ output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
146
+ print(output_text)
147
+ ```
148
+
149
+ ## Model Fine-Tuning Details
150
+ The model was fine-tuned using the Unsloth framework, which accelerated training by 2x using Huggingface's TRL (Training Reinforcement Learning) library. LoRA (Low-Rank Adaptation) was applied to fine-tune only a small subset of the parameters, which significantly reduces training time and computational resources. Fine-tuning focused on both vision and language layers, ensuring that the model could handle complex OCR tasks efficiently.
151
+
152
+ Total Trainable Parameters: 57,901,056
153
+