You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Demo Space: morocco-bo-region-classification-demo

LayoutLMv3 — Bulletin Officiel Segment Classifier

Fine-tuned microsoft/layoutlmv3-base for segment-level layout classification on French Bulletin Officiel (BO) documents.

Each annotated region (bounding box + OCR words inside) is encoded as a single sequence; the model predicts one of 14 canonical layout classes for the whole region (LayoutLMv3ForSequenceClassification).

Private model — trained on proprietary/internal BO annotations. See Training data below.

Model description

Property	Value
Architecture	`LayoutLMv3ForSequenceClassification`
Base model	`microsoft/layoutlmv3-base`
Task	Segment-level sequence classification
Classes	14 canonical BO layout roles
Input	Page image + word tokens + word bounding boxes (normalized 0–1000)
Max sequence length	512
Best checkpoint selection	`f1_macro` on validation set
Training run	`run_segments_20260619_015948` (early stopping, 19/60 epochs)

Layout classes (14)

ARTICLE, PREAMBLE, TITLE, FOOTER, TABLE, FIGURE, ANNEXE_TITLE, ANNEXE_TEXT, SOMMAIRE, CHAPTER_CONTENT, ANNEXE_LEVEL, CHAPTER_TITLE, FORM, SECTION

Training data

Fine-tuned on a French Bulletin Officiel document layout dataset built in the layoutlmv3-bo pipeline.

Source documents (document-level split, no page leakage)

Document	Pages	Split
`BO_7492_fr.pdf`	90	train
`BO_7506_Fr.pdf`	100	train
`BO_7496_Fr.pdf`	13	train
`BO_7514_Fr.pdf`	4	train
`BO_7510_Fr.pdf`	32	val
`BO_7500_Fr.pdf`	26	test

Segment statistics

Split	Segments
Train	783 (+ augmentation)
Val	221
Test	147

265 annotated pages across 6 PDFs
1,172 annotated layout regions (bounding boxes)
126,579 OCR words (PyMuPDF + EasyOCR hybrid on sparse pages)
Words inside each GT region form one training example (≥2 words per segment)
Augmentation: rare-class oversampling (×3), bbox jitter (±2 px)

Full dataset documentation: metadata/DATASET.md
Split manifest: metadata/split_manifest.json
Label schema: metadata/segment_label_schema.json

Training hyperparameters

Parameter	Value
Epochs requested	60
Epochs completed	19 (early stopping, patience 8)
Batch size (per device)	4
Gradient accumulation	2
Learning rate	2e-5
Weight decay	0.01
Warmup ratio	0.1
Optimizer metric	`f1_macro`
FP16	true
Class weights	inverse-frequency (segment_class_weights.json)

Training config: metadata/logs/run_config_segments_overnight.json
Per-epoch val metrics: metadata/logs/segments_overnight_20260619_015948_epochs.json
Step-level log: metadata/logs/training_log_segments.json

Evaluation results

Held-out test set (`BO_7500_Fr.pdf`, 147 segments)

Metric	Value
Accuracy	0.9728
F1 macro	0.8500
F1 micro	0.9728
Precision macro	0.8492
Recall macro	0.8535

Per-class F1 (test)

Class	F1
ARTICLE	1.000
FOOTER	1.000
SOMMAIRE	1.000
TABLE	1.000
TITLE	0.950
PREAMBLE	0.941
ANNEXE_TITLE	0.909
CHAPTER_TITLE	0.000 (0 support in test)

Full metrics: metadata/evaluation/metrics.json
Classification report: metadata/evaluation/classification_report.txt
Aligned predictions: metadata/evaluation/aligned_predictions.json

Evaluation plots

All plots live under evaluation/plots/ (22 files: 7 summary charts + 15 test-set page overlays on BO_7500_Fr.pdf).

Summary charts

Plot	File
Color legend (14 BO classes)	`00_color_legend.png`
Segments per class (dataset)	`01_segments_per_class.png`
Training / validation curves	`01_training_curves.png`
Confusion matrix (raw counts)	`03_confusion_raw.png`
Confusion matrix (normalized)	`03_confusion_norm.png`
Per-class F1 (test)	`04_per_class_f1.png`
Train segment distribution	`05_train_segment_distribution.png`

Test-set page overlays (`BO_7500_Fr.pdf`)

Predicted segment labels overlaid on held-out test pages (one color per BO class; see legend above).

Page	File
page 002	`overlay_BO_7500_Fr__doc7__page_002.png`
page 006	`overlay_BO_7500_Fr__doc7__page_006.png`
page 007	`overlay_BO_7500_Fr__doc7__page_007.png`
page 016	`overlay_BO_7500_Fr__doc7__page_016.png`
page 017	`overlay_BO_7500_Fr__doc7__page_017.png`
page 018	`overlay_BO_7500_Fr__doc7__page_018.png`
page 019	`overlay_BO_7500_Fr__doc7__page_019.png`
page 021	`overlay_BO_7500_Fr__doc7__page_021.png`
page 022	`overlay_BO_7500_Fr__doc7__page_022.png`
page 023	`overlay_BO_7500_Fr__doc7__page_023.png`
page 025	`overlay_BO_7500_Fr__doc7__page_025.png`
page 026	`overlay_BO_7500_Fr__doc7__page_026.png`
page 029	`overlay_BO_7500_Fr__doc7__page_029.png`
page 030	`overlay_BO_7500_Fr__doc7__page_030.png`
page 032	`overlay_BO_7500_Fr__doc7__page_032.png`

Usage

from transformers import LayoutLMv3ForSequenceClassification, LayoutLMv3Processor
from PIL import Image

model_id = "AvoCahDoe/layoutlmv3-bo-segments"
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = LayoutLMv3ForSequenceClassification.from_pretrained(model_id)

# words: list[str], boxes: list[[x0,y0,x1,y1]] normalized 0-1000
image = Image.open("page.png").convert("RGB")
encoding = processor(
    image, words, boxes=boxes,
    return_tensors="pt", truncation=True, padding="max_length", max_length=512,
)
outputs = model(**{k: v for k, v in encoding.items()})
pred_id = outputs.logits.argmax(-1).item()
label = model.config.id2label[str(pred_id)]

Pipeline integration

Used downstream in the RegionProposal pipeline: PP-DocLayout proposes region boxes → this model classifies each region into BO taxonomy.

python scripts/run_layoutlmv3_on_proposals.py --run-dir runs/BO_7458_fr

Repository layout

config.json                 # Model config + id2label / label2id
model.safetensors           # Fine-tuned weights (~481 MB)
training_args.bin           # HuggingFace TrainingArguments snapshot
metadata/
  DATASET.md                # Full dataset documentation
  segment_label_schema.json
  split_manifest.json
  evaluation/               # Test metrics, reports, predictions
  logs/                     # Training config and curves
evaluation/plots/           # Confusion matrices, training curves, overlays

Limitations

Requires pre-segmented regions (GT boxes or proposal boxes from PP-DocLayout); not an end-to-end detector.
Trained on 6 French government PDFs — may not generalize to other layouts or languages.
Rare classes (CHAPTER_TITLE, FORM, SECTION) have limited test support.
OCR quality affects performance (PyMuPDF primary, EasyOCR fallback).

Citation

@misc{layoutlmv3-bo-segments-2026,
  title={LayoutLMv3 Fine-tuned for French Bulletin Officiel Segment Classification},
  author={AvoCahDoe},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/AvoCahDoe/layoutlmv3-bo-segments}}
}

Acknowledgments

Base model: microsoft/layoutlmv3-base (LayoutLMv3, Microsoft)
Training framework: Hugging Face Transformers

Downloads last month: 9

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for AvoCahDoe/layoutlmv3-bo-segments

Base model

microsoft/layoutlmv3-base

Finetuned

(307)

this model

Space using AvoCahDoe/layoutlmv3-bo-segments 1

Evaluation results

accuracy on BO segment dataset (6 French government PDFs, 265 pages)
self-reported

0.973
f1_macro on BO segment dataset (6 French government PDFs, 265 pages)
self-reported

0.850
f1_micro on BO segment dataset (6 French government PDFs, 265 pages)
self-reported

0.973

You need to agree to share your contact information to access this model

LayoutLMv3 — Bulletin Officiel Segment Classifier

Model description

Layout classes (14)

Training data

Source documents (document-level split, no page leakage)

Segment statistics

Training hyperparameters

Evaluation results

Held-out test set (BO_7500_Fr.pdf, 147 segments)

Per-class F1 (test)

Evaluation plots

Summary charts

Test-set page overlays (BO_7500_Fr.pdf)

Usage

Pipeline integration

Repository layout

Limitations

Citation

Acknowledgments

Model tree for AvoCahDoe/layoutlmv3-bo-segments

Space using AvoCahDoe/layoutlmv3-bo-segments 1

Evaluation results

Held-out test set (`BO_7500_Fr.pdf`, 147 segments)

Test-set page overlays (`BO_7500_Fr.pdf`)