LayoutLMv3-Japanese-preview
LayoutLMv3-Japanese-preview is a multimodal pre-trained model for Japanese Document AI, built on the LayoutLMv3 architecture. The text tokenizer is replaced with a Japanese tokenizer from nlp-waseda/roberta-base-japanese, and the visual tokenizer adopts BEiT v2 VQ-KD. The model is pre-trained on ~20M Japanese web pages from NDL WARP ,PubLayNet and DocLayNet for document layout analysis.
Training Data
Pre-training
Evaluation
| Dataset | Task | Language |
|---|---|---|
| FUNSD | Form Understanding | English |
| XFUND-JA | Form Understanding | Japanese |
| JDocQA | Document Visual Question Answering | Japanese |
Tokenizer and Visual Tokenizer
| Component | Source | License |
|---|---|---|
| Text Tokenizer | nlp-waseda/roberta-base-japanese | Apache 2.0 |
| Visual Tokenizer (VQ-KD) | BEiT v2 | MIT |
| PDF Text/BBox Extractor (pre-training) | PyMuPDF (fitz) |
AGPL v3 / commercial |
Evaluation Results
FUNSD (English Form Understanding)
| Model | F1 (test) |
|---|---|
| LayoutLMv3 | 0.9059 |
| LayoutLMv2 | 0.8276 |
| LayoutLMv3-Japanese-preview (Ours) | 0.8284 |
XFUND-JA (Japanese Form Understanding)
| Model | F1 (test) |
|---|---|
| LayoutXLM | 0.7921 |
| LayoutLMv3-Japanese-preview (Ours) | 0.7436 |
JDocQA (Japanese Document Visual Question Answering)
| Task | F1 (test) | Accuracy (test) |
|---|---|---|
| LayoutLMv3-Japanese-preview (Ours) | ||
| Yes/No | 0.5750 | 0.7639 |
| Multiple Choice (4 options) | 0.8351 | 0.8353 |
| LayoutXLM | ||
| Yes/No | 0.5403 | 0.7847 |
| Multiple Choice (4 options) | 0.8544 | 0.8543 |
Example Usage
The tokenizer is an AlbertTokenizer (SentencePiece) inherited from
nlp-waseda/roberta-base-japanese,
so the standard LayoutLMv3Processor cannot be used as-is. Instead,
combine AutoTokenizer with LayoutLMv3ImageProcessor and align each
subword's bbox with the source word.
Requirements: transformers>=4.44, torch, Pillow. PyMuPDF
(imported as fitz) is optional — only needed if you want to
render a PDF on the fly. If you just run the provided sample.png,
you do not need PyMuPDF. Words and bounding boxes can be produced by
any OCR engine that supports Japanese (e.g. PaddleOCR, Tesseract with
jpn traineddata, or manga-ocr). Boxes must be in pixel coordinates
(x0, y0, x1, y1) and normalized to LayoutLMv3's 0–1000 range.
A sample Japanese document image is provided as sample.png in this
repository for quick experimentation. The snippet below supports two
input paths: load sample.png directly, or render the first page of
any PDF via fitz (PyMuPDF) — uncomment the branch you want.
import io
import torch
from PIL import Image
from transformers import AutoTokenizer, LayoutLMv3ImageProcessor, LayoutLMv3Model
REPO = "llm-jp/layoutlmv3-japanese-preview"
tokenizer = AutoTokenizer.from_pretrained(REPO)
image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
model = LayoutLMv3Model.from_pretrained(REPO).eval()
# --- Option A: use the bundled sample.png (no PyMuPDF required) ---
image = Image.open("sample.png").convert("RGB")
# --- Option B: render the first page of any PDF via fitz (PyMuPDF) ---
# import fitz
# doc = fitz.open("your_document.pdf")
# pix = doc[0].get_pixmap(dpi=200)
# image = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
width, height = image.size
# Replace with your OCR output. Each box is (x0, y0, x1, y1) in pixel coords.
words = ["石巻市駅前北通り", "災害公営住宅", "完成資料"]
boxes = [
(160, 28, 420, 70),
(430, 28, 610, 70),
(625, 28, 770, 70),
]
def normalize(box, w, h):
x0, y0, x1, y1 = box
return [
int(1000 * x0 / w), int(1000 * y0 / h),
int(1000 * x1 / w), int(1000 * y1 / h),
]
# Tokenize each word and assign the word's bbox to all of its subwords.
input_ids = [tokenizer.cls_token_id]
bbox = [[0, 0, 0, 0]]
for word, box in zip(words, [normalize(b, width, height) for b in boxes]):
ids = tokenizer(word, add_special_tokens=False)["input_ids"]
input_ids += ids
bbox += [box] * len(ids)
input_ids.append(tokenizer.sep_token_id)
bbox.append([1000, 1000, 1000, 1000])
input_ids = torch.tensor([input_ids])
bbox = torch.tensor([bbox])
attention_mask = torch.ones_like(input_ids)
pixel_values = image_processor(images=image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
outputs = model(
input_ids=input_ids,
bbox=bbox,
attention_mask=attention_mask,
pixel_values=pixel_values,
)
# (batch, num_text_tokens + num_image_patches, hidden_size) — e.g. (1, 213, 768)
print(outputs.last_hidden_state.shape)
For downstream tasks (token classification, QA, etc.), swap LayoutLMv3Model
for the corresponding task head class such as
LayoutLMv3ForTokenClassification or LayoutLMv3ForQuestionAnswering
and fine-tune on your labeled data.
License
This model is licensed under Apache 2.0.
Pre-training Pipeline (Text & Layout Extraction)
For the NDL WARP PDF corpus, word-level text and bounding boxes were
extracted with PyMuPDF (fitz) —
not a learned OCR model — to build (word, bbox) pairs used as the 1D
text + 2D layout inputs for pre-training. PyMuPDF was used strictly as
an internal data-processing tool: the library was not modified,
was not redistributed, and is neither embedded in nor linked to the
published model weights. Under AGPL v3, the copyleft obligations
(§5 "Conveying Modified Source Versions" and §13 "Remote Network
Interaction") attach only when the covered software itself is
conveyed or served to users over a network. Running PyMuPDF
locally to produce derived data (text strings and coordinate tuples),
and then training a separate model on that data, is a permitted use
and does not cause AGPL terms to propagate to the resulting model
weights. We therefore consider this usage to be license-compliant.
The weights distributed here remain Apache 2.0.
Training Data and Legal Notice
- Pre-training data: NDL WARP (Japanese web archive), PubLayNet (CDLA-Permissive-1.0), DocLayNet (CDLA-Permissive-1.0)
The document images in PubLayNet originate from the PMC Open Access Commercial Use Collection, which includes articles under CC0, CC BY, CC BY-SA, and CC BY-ND licenses. We used these documents for model training under the application of Article 30-4 of the Japanese Copyright Law (2026).
Users outside Japan should assess the applicability of their local copyright exceptions when using this model.
- Downloads last month
- 27
Model tree for llm-jp/layoutlmv3-japanese-preview
Base model
nlp-waseda/roberta-base-japanese