lucyknada commited on
Commit
fa39ba4
·
verified ·
1 Parent(s): 46c1225

Upload ./README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ datasets:
6
+ - allenai/olmOCR-mix-0225
7
+ base_model:
8
+ - Qwen/Qwen2-VL-7B-Instruct
9
+ library_name: transformers
10
+ ---
11
+ ### exl2 quant (measurement.json in main branch)
12
+ ---
13
+ ### check revisions for quants
14
+ ---
15
+
16
+
17
+ <img alt="olmOCR Logo" src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/olmocr/olmocr.png" width="242px" style="margin-left:'auto' margin-right:'auto' display:'block'">
18
+
19
+ # olmOCR-7B-0225-preview
20
+
21
+ This is a preview release of the olmOCR model that's fine tuned from Qwen2-VL-7B-Instruct using the
22
+ [olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) dataset.
23
+
24
+ Quick links:
25
+ - 📃 [Paper](https://olmocr.allenai.org/papers/olmocr.pdf)
26
+ - 🤗 [Dataset](https://huggingface.co/datasets/allenai/olmOCR-mix-0225)
27
+ - 🛠️ [Code](https://github.com/allenai/olmocr)
28
+ - 🎮 [Demo](https://olmocr.allenai.org/)
29
+
30
+ The best way to use this model is via the [olmOCR toolkit](https://github.com/allenai/olmocr).
31
+ The toolkit comes with an efficient inference setup via sglang that can handle millions of documents
32
+ at scale.
33
+
34
+ ## Usage
35
+
36
+ This model expects as input a single document image, rendered such that the longest dimension is 1024 pixels.
37
+
38
+ The prompt must then contain the additional metadata from the document, and the easiest way to generate this
39
+ is to use the methods provided by the [olmOCR toolkit](https://github.com/allenai/olmocr).
40
+
41
+
42
+ ## Manual Prompting
43
+
44
+ If you want to prompt this model manually instead of using the [olmOCR toolkit](https://github.com/allenai/olmocr), please see the code below.
45
+
46
+ In normal usage, the olmOCR toolkit builds the prompt by rendering the PDF page, and
47
+ extracting relevant text blocks and image metadata. To duplicate that you will need to
48
+
49
+ ```bash
50
+ pip install olmocr
51
+ ```
52
+
53
+ and then run the following sample code.
54
+
55
+
56
+ ```python
57
+ import torch
58
+ import base64
59
+ import urllib.request
60
+
61
+ from io import BytesIO
62
+ from PIL import Image
63
+ from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
64
+
65
+ from olmocr.data.renderpdf import render_pdf_to_base64png
66
+ from olmocr.prompts import build_finetuning_prompt
67
+ from olmocr.prompts.anchor import get_anchor_text
68
+
69
+ # Initialize the model
70
+ model = Qwen2VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16).eval()
71
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
72
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
73
+ model.to(device)
74
+
75
+ # Grab a sample PDF
76
+ urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")
77
+
78
+ # Render page 1 to an image
79
+ image_base64 = render_pdf_to_base64png("./paper.pdf", 1, target_longest_image_dim=1024)
80
+
81
+ # Build the prompt, using document metadata
82
+ anchor_text = get_anchor_text("./paper.pdf", 1, pdf_engine="pdfreport", target_length=4000)
83
+ prompt = build_finetuning_prompt(anchor_text)
84
+
85
+ # Build the full prompt
86
+ messages = [
87
+ {
88
+ "role": "user",
89
+ "content": [
90
+ {"type": "text", "text": prompt},
91
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
92
+ ],
93
+ }
94
+ ]
95
+
96
+ # Apply the chat template and processor
97
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
98
+ main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
99
+
100
+ inputs = processor(
101
+ text=[text],
102
+ images=[main_image],
103
+ padding=True,
104
+ return_tensors="pt",
105
+ )
106
+ inputs = {key: value.to(device) for (key, value) in inputs.items()}
107
+
108
+
109
+ # Generate the output
110
+ output = model.generate(
111
+ **inputs,
112
+ temperature=0.8,
113
+ max_new_tokens=50,
114
+ num_return_sequences=1,
115
+ do_sample=True,
116
+ )
117
+
118
+ # Decode the output
119
+ prompt_length = inputs["input_ids"].shape[1]
120
+ new_tokens = output[:, prompt_length:]
121
+ text_output = processor.tokenizer.batch_decode(
122
+ new_tokens, skip_special_tokens=True
123
+ )
124
+
125
+ print(text_output)
126
+ # ['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
127
+ ```
128
+
129
+ ## License and use
130
+
131
+ olmOCR is licensed under the Apache 2.0 license.
132
+ olmOCR is intended for research and educational use.
133
+ For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).