LayoutLMv3-Japanese-preview

LayoutLMv3-Japanese-preview is a multimodal pre-trained model for Japanese Document AI, built on the LayoutLMv3 architecture. The text tokenizer is replaced with a Japanese tokenizer from nlp-waseda/roberta-base-japanese, and the visual tokenizer adopts BEiT v2 VQ-KD. The model is pre-trained on ~20M Japanese web pages from NDL WARP ,PubLayNet and DocLayNet for document layout analysis.

Training Data

Pre-training

Dataset	Scale
NDL WARP	20M+ pages
PubLayNet	360K+ images
DocLayNet	80K+ images

Evaluation

Dataset	Task	Language
FUNSD	Form Understanding	English
XFUND-JA	Form Understanding	Japanese
JDocQA	Document Visual Question Answering	Japanese

Tokenizer and Visual Tokenizer

Component	Source	License
Text Tokenizer	nlp-waseda/roberta-base-japanese	Apache 2.0
Visual Tokenizer (VQ-KD)	BEiT v2	MIT
PDF Text/BBox Extractor (pre-training)	PyMuPDF (`fitz`)	AGPL v3 / commercial

Evaluation Results

FUNSD (English Form Understanding)

Model	F1 (test)
LayoutLMv3	0.9059
LayoutLMv2	0.8276
LayoutLMv3-Japanese-preview (Ours)	0.8284

XFUND-JA (Japanese Form Understanding)

Model	F1 (test)
LayoutXLM	0.7921
LayoutLMv3-Japanese-preview (Ours)	0.7436

JDocQA (Japanese Document Visual Question Answering)

Task	F1 (test)	Accuracy (test)
LayoutLMv3-Japanese-preview (Ours)
Yes/No	0.5750	0.7639
Multiple Choice (4 options)	0.8351	0.8353
LayoutXLM
Yes/No	0.5403	0.7847
Multiple Choice (4 options)	0.8544	0.8543

Example Usage

The tokenizer is an AlbertTokenizer (SentencePiece) inherited from nlp-waseda/roberta-base-japanese, so the standard LayoutLMv3Processor cannot be used as-is. Instead, combine AutoTokenizer with LayoutLMv3ImageProcessor and align each subword's bbox with the source word.

Requirements: transformers>=4.44, torch, Pillow. PyMuPDF (imported as fitz) is optional — only needed if you want to render a PDF on the fly. If you just run the provided sample.png, you do not need PyMuPDF. Words and bounding boxes can be produced by any OCR engine that supports Japanese (e.g. PaddleOCR, Tesseract with jpn traineddata, or manga-ocr). Boxes must be in pixel coordinates (x0, y0, x1, y1) and normalized to LayoutLMv3's 0–1000 range.

A sample Japanese document image is provided as sample.png in this repository for quick experimentation. The snippet below supports two input paths: load sample.png directly, or render the first page of any PDF via fitz (PyMuPDF) — uncomment the branch you want.

import io

import torch
from PIL import Image
from transformers import AutoTokenizer, LayoutLMv3ImageProcessor, LayoutLMv3Model

REPO = "llm-jp/layoutlmv3-japanese-preview"

tokenizer = AutoTokenizer.from_pretrained(REPO)
image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
model = LayoutLMv3Model.from_pretrained(REPO).eval()

# --- Option A: use the bundled sample.png (no PyMuPDF required) ---
image = Image.open("sample.png").convert("RGB")

# --- Option B: render the first page of any PDF via fitz (PyMuPDF) ---
# import fitz
# doc = fitz.open("your_document.pdf")
# pix = doc[0].get_pixmap(dpi=200)
# image = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")

width, height = image.size

# Replace with your OCR output. Each box is (x0, y0, x1, y1) in pixel coords.
words = ["石巻市駅前北通り", "災害公営住宅", "完成資料"]
boxes = [
    (160,  28, 420, 70),
    (430,  28, 610, 70),
    (625,  28, 770, 70),
]

def normalize(box, w, h):
    x0, y0, x1, y1 = box
    return [
        int(1000 * x0 / w), int(1000 * y0 / h),
        int(1000 * x1 / w), int(1000 * y1 / h),
    ]

# Tokenize each word and assign the word's bbox to all of its subwords.
input_ids = [tokenizer.cls_token_id]
bbox      = [[0, 0, 0, 0]]
for word, box in zip(words, [normalize(b, width, height) for b in boxes]):
    ids = tokenizer(word, add_special_tokens=False)["input_ids"]
    input_ids += ids
    bbox      += [box] * len(ids)
input_ids.append(tokenizer.sep_token_id)
bbox.append([1000, 1000, 1000, 1000])

input_ids      = torch.tensor([input_ids])
bbox           = torch.tensor([bbox])
attention_mask = torch.ones_like(input_ids)
pixel_values   = image_processor(images=image, return_tensors="pt")["pixel_values"]

with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        bbox=bbox,
        attention_mask=attention_mask,
        pixel_values=pixel_values,
    )

# (batch, num_text_tokens + num_image_patches, hidden_size) — e.g. (1, 213, 768)
print(outputs.last_hidden_state.shape)

For downstream tasks (token classification, QA, etc.), swap LayoutLMv3Model for the corresponding task head class such as LayoutLMv3ForTokenClassification or LayoutLMv3ForQuestionAnswering and fine-tune on your labeled data.

License

This model is licensed under Apache 2.0.

Pre-training Pipeline (Text & Layout Extraction)

For the NDL WARP PDF corpus, word-level text and bounding boxes were extracted with PyMuPDF (fitz) — not a learned OCR model — to build (word, bbox) pairs used as the 1D text + 2D layout inputs for pre-training. PyMuPDF was used strictly as an internal data-processing tool: the library was not modified, was not redistributed, and is neither embedded in nor linked to the published model weights. Under AGPL v3, the copyleft obligations (§5 "Conveying Modified Source Versions" and §13 "Remote Network Interaction") attach only when the covered software itself is conveyed or served to users over a network. Running PyMuPDF locally to produce derived data (text strings and coordinate tuples), and then training a separate model on that data, is a permitted use and does not cause AGPL terms to propagate to the resulting model weights. We therefore consider this usage to be license-compliant. The weights distributed here remain Apache 2.0.

Training Data and Legal Notice

Pre-training data: NDL WARP (Japanese web archive), PubLayNet (CDLA-Permissive-1.0), DocLayNet (CDLA-Permissive-1.0)

The document images in PubLayNet originate from the PMC Open Access Commercial Use Collection, which includes articles under CC0, CC BY, CC BY-SA, and CC BY-ND licenses. We used these documents for model training under the application of Article 30-4 of the Japanese Copyright Law (2026).

Users outside Japan should assess the applicability of their local copyright exceptions when using this model.

Downloads last month: 27

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-jp/layoutlmv3-japanese-preview

Base model

nlp-waseda/roberta-base-japanese

Finetuned

(2)

this model