
SmolDocling-256M-preview
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
π Features:
- π·οΈ DocTags for Efficient Tokenization β Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
- π OCR (Optical Character Recognition) β Extracts text accurately from images.
- π Layout and Localization β Preserves document structure and document element bounding boxes.
- π» Code Recognition β Detects and formats code blocks including identation.
- π’ Formula Recognition β Identifies and processes mathematical expressions.
- π Chart Recognition β Extracts and interprets chart data.
- π Table Recognition β Supports column and row headers for structured table extraction.
- πΌοΈ Figure Classification β Differentiates figures and graphical elements.
- π Caption Correspondence β Links captions to relevant images and figures.
- π List Grouping β Organizes and structures list elements correctly.
- π Full-Page Conversion β Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- π² OCR with Bounding Boxes β OCR regions using a bounding box.
- π General Document Processing β Trained for both scientific and non-scientific documents.
- π Seamless Docling Integration β Import into Docling and export in multiple formats.
- π¨ Fast inference using VLLM β Avg of 0.35 secs per page on A100 GPU.
π§ Coming soon!
- π Better chart recognition π οΈ
- π One shot multi-page inference β±οΈ
- π§ͺ Chemical Recognition
- π Datasets
β¨οΈ Get started (code examples)
You can use transformers or vllm to perform inference, and Docling to convert results to variety of ourput formats (md, html, etc.):
π Single page image inference using Tranformers π€
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())
π Fast Batch Inference Using VLLM
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "img/" # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
doctags = output.outputs[0].text
img_fn = os.path.splitext(img_file)[0]
output_filename = img_fn + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(doctags)
# To convert to Docling Document, MD, HTML, etc.:
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
output_filename_md = img_fn + ".md"
output_path_md = os.path.join(OUTPUT_DIR, output_filename_md)
doc.save_as_markdown(output_path_md)
print(f"Total time: {time.time() - start_time:.2f} sec")
DocTags

Supported Instructions
Description | Instruction | Comment |
Full conversion | Convert this page to docling. | DocTags represetation |
Chart | Convert chart to table. | (e.g., <chart>) |
Formula | Convert formula to LaTeX. | (e.g., <formula>) |
Code | Convert code to text. | (e.g., <code>) |
Table | Convert table to OTSL. | (e.g., <otsl>) OTSL: Lysak et al., 2023 |
Actions and Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | |
Identify element at: <loc_247><loc_482><10c_252><loc_486> | ||
Find all 'text' elements on the page, retrieve all section headers. | ||
Detect footer elements on the page. |
Model Summary
- Developed by: Docling Team, IBM Research
- Model type: Multi-modal model (image+text)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Based on Idefics3 (see technical summary)
Repository: Docling
Paper: [Coming soon]
Demo: [Coming soon]
- Downloads last month
- 167
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for ds4sd/SmolDocling-256M-preview
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct
Quantized
HuggingFaceTB/SmolVLM-256M-Instruct