lucyknada
/

allenai_olmOCR-7B-0225-preview-exl2

Transformers

English

Inference Endpoints

Model card Files Files and versions Community

lucyknada commited on 1 day ago

Commit

fa39ba4

verified ·

1 Parent(s): 46c1225

Upload ./README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +133 -0

README.md ADDED Viewed

	@@ -0,0 +1,133 @@

+---
+language:
+- en
+license: apache-2.0
+datasets:
+- allenai/olmOCR-mix-0225
+base_model:
+- Qwen/Qwen2-VL-7B-Instruct
+library_name: transformers
+---
+### exl2 quant (measurement.json in main branch)
+---
+### check revisions for quants
+---
+<img alt="olmOCR Logo" src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/olmocr/olmocr.png" width="242px" style="margin-left:'auto' margin-right:'auto' display:'block'">
+# olmOCR-7B-0225-preview
+This is a preview release of the olmOCR model that's fine tuned from Qwen2-VL-7B-Instruct using the
+[olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) dataset.
+Quick links:
+- 📃 [Paper](https://olmocr.allenai.org/papers/olmocr.pdf)
+- 🤗 [Dataset](https://huggingface.co/datasets/allenai/olmOCR-mix-0225)
+- 🛠️ [Code](https://github.com/allenai/olmocr)
+- 🎮 [Demo](https://olmocr.allenai.org/)
+The best way to use this model is via the [olmOCR toolkit](https://github.com/allenai/olmocr).
+The toolkit comes with an efficient inference setup via sglang that can handle millions of documents
+at scale.
+## Usage
+This model expects as input a single document image, rendered such that the longest dimension is 1024 pixels.
+The prompt must then contain the additional metadata from the document, and the easiest way to generate this
+is to use the methods provided by the [olmOCR toolkit](https://github.com/allenai/olmocr).
+## Manual Prompting
+If you want to prompt this model manually instead of using the [olmOCR toolkit](https://github.com/allenai/olmocr), please see the code below.
+In normal usage, the olmOCR toolkit builds the prompt by rendering the PDF page, and
+extracting relevant text blocks and image metadata. To duplicate that you will need to
+```bash
+pip install olmocr
+```
+and then run the following sample code.
+```python
+import torch
+import base64
+import urllib.request
+from io import BytesIO
+from PIL import Image
+from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
+from olmocr.data.renderpdf import render_pdf_to_base64png
+from olmocr.prompts import build_finetuning_prompt
+from olmocr.prompts.anchor import get_anchor_text
+# Initialize the model
+model = Qwen2VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16).eval()
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+# Grab a sample PDF
+urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")
+# Render page 1 to an image
+image_base64 = render_pdf_to_base64png("./paper.pdf", 1, target_longest_image_dim=1024)
+# Build the prompt, using document metadata
+anchor_text = get_anchor_text("./paper.pdf", 1, pdf_engine="pdfreport", target_length=4000)
+prompt = build_finetuning_prompt(anchor_text)
+# Build the full prompt
+messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
+                ],
+            }
+        ]
+# Apply the chat template and processor
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
+inputs = processor(
+    text=[text],
+    images=[main_image],
+    padding=True,
+    return_tensors="pt",
+)
+inputs = {key: value.to(device) for (key, value) in inputs.items()}
+# Generate the output
+output = model.generate(
+            **inputs,
+            temperature=0.8,
+            max_new_tokens=50,
+            num_return_sequences=1,
+            do_sample=True,
+        )
+# Decode the output
+prompt_length = inputs["input_ids"].shape[1]
+new_tokens = output[:, prompt_length:]
+text_output = processor.tokenizer.batch_decode(
+    new_tokens, skip_special_tokens=True
+)
+print(text_output)
+# ['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
+```
+## License and use
+olmOCR is licensed under the Apache 2.0 license.
+olmOCR is intended for research and educational use.
+For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).