Spaces:

outcomelabs
/

docling-parser

Running on L4

App Files Files Community

docling-parser / README.md

Ibad ur Rehman

feat: add figure transcription to docling parser

dfb4c77 about 4 hours ago

preview code

raw

history blame contribute delete

11.5 kB

metadata

title: Docling Parser API
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: t4-small

Docling Parser API

A FastAPI service that turns PDFs and Excel workbooks into LLM-ready markdown using a hybrid parser:

Pass 1: Docling parses the document (layout + TableFormer, OCR disabled)
Pass 2: Gemini 3 Flash reparses table-heavy and weak-text pages
Pass 2.5 (opt-in): Gemini summarises qualifying charts / diagrams / figures
Post: Artifact removal, deduplication, table cleanup

Features

Docling core parser — layout analysis, TableFormer, cross-page understanding
Gemini page enhancement — higher-fidelity reparse for table or weak-text pages
Gemini figure transcription — optional short visual summaries for charts and diagrams (default off)
Excel support — .xlsx / .xlsm workbooks rendered as HTML <table> markdown
Optional image ZIP — return all extracted pictures as a base64 ZIP
URL parsing with SSRF protection — blocks private / loopback / cloud-metadata hosts
T4-friendly — fits comfortably in 16 GB VRAM

Architecture

PDF / Excel
  -> Pass 1: Docling (layout + TableFormer, no OCR)
  -> Pass 2: Gemini 3 Flash on table pages and weak-text pages
  -> Pass 2.5 (opt-in): Gemini describes qualifying PictureItems
  -> Post-processing: artifact removal, dedup, table cleanup
  -> Final markdown response

API Endpoints

Endpoint	Method	Description
`/`	GET	Health check with model and Gemini status
`/parse`	POST	Parse an uploaded file (`multipart/form-data`)
`/parse/url`	POST	Parse a document from a URL (JSON body)
`/docs`	GET	OpenAPI documentation (Swagger UI)

Authentication

All parse endpoints require a bearer token:

Authorization: Bearer YOUR_API_TOKEN

Set API_TOKEN in Hugging Face Space secrets.

Supported file types

.pdf
.xlsx
.xlsm

Other types (images, Word, etc.) return 400 Unsupported file type.

Quick Start

cURL: Upload a File

curl -X POST "https://outcomelabs-docling-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf"

cURL: With figure transcription

curl -X POST "https://outcomelabs-docling-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "transcribe_images=true"

cURL: Parse from URL

curl -X POST "https://outcomelabs-docling-parser.hf.space/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "transcribe_images": true}'

Python

import requests

API_URL = "https://outcomelabs-docling-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"transcribe_images": "true"},
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
    print(
        f"Figures detected={result['images_detected']}, "
        f"considered={result['images_considered']}, "
        f"transcribed={result['images_transcribed']}"
    )
    print(result["markdown"])
else:
    print(result["error"])

Request Parameters

`/parse`

Parameter	Type	Required	Default	Description
`file`	File	Yes	-	`.pdf`, `.xlsx`, or `.xlsm`
`output_format`	string	No	`markdown`	Only `markdown` is currently supported
`images_scale`	float	No	`2.0`	Accepted for compatibility
`start_page`	int	No	`0`	Starting page, zero-indexed (PDF only)
`end_page`	int	No	`null`	Ending page, or all pages if omitted (PDF only)
`include_images`	bool	No	`false`	Include extracted images as a base64 ZIP payload
`transcribe_images`	bool	No	`null`	Transcribe qualifying charts/diagrams inline. `null` = use server `TRANSCRIBE_IMAGES` default

`/parse/url`

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	Source document URL
`output_format`	string	No	`markdown`	Only `markdown` is currently supported
`images_scale`	float	No	`2.0`	Accepted for compatibility
`start_page`	int	No	`0`	Starting page, zero-indexed (PDF only)
`end_page`	int	No	`null`	Ending page, or all pages if omitted (PDF only)
`include_images`	bool	No	`false`	Include extracted images as a base64 ZIP
`transcribe_images`	bool	No	`null`	Transcribe qualifying charts/diagrams inline

Response Format

{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "device_used": "cpu",
  "vlm_model": "Docling + Gemini",
  "gemini_page_count": 3,
  "gemini_pages": [2, 7, 12],
  "images_detected": 14,
  "images_considered": 6,
  "images_transcribed": 5
}

Field	Type	Description
`success`	boolean	Whether parsing succeeded
`markdown`	string	Extracted markdown
`json_content`	object	Reserved field, currently `null`
`images_zip`	string	Base64 ZIP of extracted images (when `include_images=true`)
`image_count`	int	Number of images in the ZIP
`error`	string	Error message when parsing fails
`pages_processed`	int	Number of pages processed
`device_used`	string	Device label returned by the service
`vlm_model`	string	Active parser label (`Docling + Gemini` or `openpyxl`)
`gemini_page_count`	int	Number of pages reparsed by Gemini in Pass 2
`gemini_pages`	int[]	Absolute page numbers reparsed by Gemini
`images_detected`	int	Total PictureItems Docling emitted
`images_considered`	int	PictureItems that passed the local size/area filter
`images_transcribed`	int	PictureItems that returned a non-SKIP Gemini description

Figure Transcription

When transcribe_images=true (or TRANSCRIBE_IMAGES=true on the server) and a GEMINI_API_KEY is configured, qualifying figures are sent to Gemini for a concise visual summary and inserted into the markdown as blockquotes:

> **Figure (page 7):** Bar chart of quarterly revenue from 2020–2024 showing an upward trend; peak around Q4 2024.

Figures that are decorative (logos, dividers, small icons) are filtered out locally by size and bbox-area thresholds; any remaining low-value images are skipped by Gemini via a [SKIP] escape token.

A per-request cap (MAX_IMAGE_TRANSCRIPTIONS, default 50) protects against documents with hundreds of charts. When exceeded, the largest figures by bbox area are kept and the rest dropped with a warning in the server logs.

Configuration

Environment Variable	Description	Default
`API_TOKEN`	Required API authentication token	-
`MAX_FILE_SIZE_MB`	Maximum upload size in MB	`1024`
`IMAGES_SCALE`	Image scale for extracted pictures	`2.0`
`RENDER_DPI`	DPI for PDF→PNG rendering (Gemini page input)	`200`
`GEMINI_API_KEY`	Gemini API key	-
`GEMINI_MODEL`	Gemini model name	`gemini-3-flash-preview`
`GEMINI_TIMEOUT`	Gemini request timeout in seconds	`120`
`GEMINI_CONCURRENCY`	Max concurrent page-level Gemini requests	`8`
`TRANSCRIBE_IMAGES`	Default for figure transcription (overridable per req)	`false`
`IMAGE_TRANSCRIPTION_MIN_PX`	Min pixel dimension to qualify a figure	`150`
`IMAGE_TRANSCRIPTION_MIN_AREA_RATIO`	Min bbox-to-page area ratio to qualify a figure	`0.02`
`MAX_IMAGE_TRANSCRIPTIONS`	Hard per-request cap on figure Gemini calls	`50`
`GEMINI_IMAGE_CONCURRENCY`	Max concurrent figure-level Gemini calls	`8`

Logging

The service logs:

Unique 8-character request IDs on every line
File size, type, and page range
Pass 1 (Docling), Pass 2 (Gemini pages), Pass 2.5 (figures), post-processing timings
Figure counts (detected / considered / transcribed) and cap-truncation warnings
Final pages/sec and total processing time

Credits

Built with Docling, Gemini, FastAPI, and supporting Python tooling for document parsing.