Instructions to use Charitarth/olmOCR-2-7B-1025-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Charitarth/olmOCR-2-7B-1025-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Charitarth/olmOCR-2-7B-1025-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Charitarth/olmOCR-2-7B-1025-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Charitarth/olmOCR-2-7B-1025-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Charitarth/olmOCR-2-7B-1025-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Charitarth/olmOCR-2-7B-1025-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Charitarth/olmOCR-2-7B-1025-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Charitarth/olmOCR-2-7B-1025-NVFP4
- SGLang
How to use Charitarth/olmOCR-2-7B-1025-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Charitarth/olmOCR-2-7B-1025-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Charitarth/olmOCR-2-7B-1025-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Charitarth/olmOCR-2-7B-1025-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Charitarth/olmOCR-2-7B-1025-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Charitarth/olmOCR-2-7B-1025-NVFP4 with Docker Model Runner:
docker model run hf.co/Charitarth/olmOCR-2-7B-1025-NVFP4
olmOCR-2-7B-1025-NVFP4
NVFP4 (W4A4) quantized version of allenai/olmOCR-2-7B-1025 — a vision-language model for high-quality document OCR.
Weights and activations are quantized to FP4 using the NVFP4 scheme, with the vision encoder and lm_head kept at full precision. The model is stored in compressed-tensors format for native vLLM inference.
Benchmark Results
Evaluated on the full OlmOCR Bench suite (120 tests):
| Model | Aggregate | baseline | table | absent | order | math | present |
|---|---|---|---|---|---|---|---|
| Original (BF16) | 90.9% | 100% | 85% | 83.3% | 76.9% | 75% | 73.3% |
| This model (NVFP4) | 93.6% | 100% | 90% | 87.5% | 76.9% | 75% | 73.3% |
Model size: 6.8 GB (vs ~15 GB for the BF16 original).
How It Was Quantized
Quantized with llm-compressor using post-training quantization (PTQ) with the QuantizationModifier and the NVFP4 scheme.
Recipe (recipe.yaml):
quant_stage:
quant_modifiers:
QuantizationModifier:
targets: [Linear]
scheme: NVFP4
ignore: [lm_head, 're:model.visual.*']
Calibration data: 1000 text samples from allenai/olmOCR-mix-1025 (config 00_documents), using the natural_text field wrapped in the model's chat template.
Quantization script:
from datasets import Dataset, load_dataset
from llmcompressor import oneshot
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"allenai/olmOCR-2-7B-1025",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("allenai/olmOCR-2-7B-1025")
# Prepare calibration data from olmOCR-mix-1025
raw_dataset = load_dataset(
"allenai/olmOCR-mix-1025", "00_documents", split="train", streaming=True
)
samples = list(raw_dataset.shuffle(seed=42).take(1000))
dataset_items = []
for item in samples:
text = item["natural_text"]
if not text or not text.strip():
continue
messages = [{"role": "user", "content": [
{"type": "text", "text": f"Convert this document to markdown:\n\n{text}"}
]}]
rendered = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[rendered], padding=False, max_length=8192, truncation=True)
dataset_items.append(inputs)
batch_dict = {key: [item[key] for item in dataset_items] for key in dataset_items[0].keys()}
calibration_dataset = Dataset.from_dict(batch_dict)
oneshot(
model=model,
dataset=calibration_dataset,
recipe="recipe.yaml",
data_collator=lambda batch: {k: torch.tensor(v) for k, v in batch[0].items()},
num_calibration_samples=len(calibration_dataset),
max_seq_length=8192,
)
model.save_pretrained("olmOCR-2-7B-1025-NVFP4")
processor.save_pretrained("olmOCR-2-7B-1025-NVFP4")
Serving with vLLM
Requires vLLM v0.18.0+ with NVFP4 support.
vllm serve Charitarth/olmOCR-2-7B-1025-NVFP4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9
Using with olmOCR Pipeline
Once the vLLM server is running, point the olmocr pipeline at it:
# Install olmocr
pip install olmocr
# Option 1: Point at your running vLLM server
olmocr ./workspace \
--server http://localhost:8000/v1 \
--model Charitarth/olmOCR-2-7B-1025-NVFP4 \
--markdown \
--pdfs your_document.pdf
# Option 2: Let olmocr manage the vLLM server (pass extra args through)
olmocr ./workspace \
--model Charitarth/olmOCR-2-7B-1025-NVFP4 \
--markdown \
--pdfs your_document.pdf
Python API
import asyncio
from olmocr.pipeline import main as olmocr_main
# olmocr's pipeline accepts the same arguments programmatically
asyncio.run(olmocr_main([
"./workspace",
"--server", "http://localhost:8000/v1",
"--model", "Charitarth/olmOCR-2-7B-1025-NVFP4",
"--markdown",
"--pdfs", "your_document.pdf",
]))
Details
- Base model: allenai/olmOCR-2-7B-1025 (Qwen2.5-VL 7B architecture)
- Quantization: NVFP4 (W4A4) — 4-bit weights, 4-bit activations, FP4 numeric type
- Format: compressed-tensors (
nvfp4-pack-quantized) - Preserved layers: Vision encoder (
model.visual.*), language model head (lm_head) - Group size: 16 (both weights and activations)
- Calibration: 1000 samples, text-only, from olmOCR-mix-1025
- Downloads last month
- 643