You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Demo Space: morocco-bo-region-classification-demo

LayoutLMv3 — Bulletin Officiel Segment Classifier

Fine-tuned microsoft/layoutlmv3-base for segment-level layout classification on French Bulletin Officiel (BO) documents.

Each annotated region (bounding box + OCR words inside) is encoded as a single sequence; the model predicts one of 14 canonical layout classes for the whole region (LayoutLMv3ForSequenceClassification).

Private model — trained on proprietary/internal BO annotations. See Training data below.

Model description

Property Value
Architecture LayoutLMv3ForSequenceClassification
Base model microsoft/layoutlmv3-base
Task Segment-level sequence classification
Classes 14 canonical BO layout roles
Input Page image + word tokens + word bounding boxes (normalized 0–1000)
Max sequence length 512
Best checkpoint selection f1_macro on validation set
Training run run_segments_20260619_015948 (early stopping, 19/60 epochs)

Layout classes (14)

ARTICLE, PREAMBLE, TITLE, FOOTER, TABLE, FIGURE, ANNEXE_TITLE, ANNEXE_TEXT, SOMMAIRE, CHAPTER_CONTENT, ANNEXE_LEVEL, CHAPTER_TITLE, FORM, SECTION

Training data

Fine-tuned on a French Bulletin Officiel document layout dataset built in the layoutlmv3-bo pipeline.

Source documents (document-level split, no page leakage)

Document Pages Split
BO_7492_fr.pdf 90 train
BO_7506_Fr.pdf 100 train
BO_7496_Fr.pdf 13 train
BO_7514_Fr.pdf 4 train
BO_7510_Fr.pdf 32 val
BO_7500_Fr.pdf 26 test

Segment statistics

Split Segments
Train 783 (+ augmentation)
Val 221
Test 147
  • 265 annotated pages across 6 PDFs
  • 1,172 annotated layout regions (bounding boxes)
  • 126,579 OCR words (PyMuPDF + EasyOCR hybrid on sparse pages)
  • Words inside each GT region form one training example (≥2 words per segment)
  • Augmentation: rare-class oversampling (×3), bbox jitter (±2 px)

Full dataset documentation: metadata/DATASET.md
Split manifest: metadata/split_manifest.json
Label schema: metadata/segment_label_schema.json

Training hyperparameters

Parameter Value
Epochs requested 60
Epochs completed 19 (early stopping, patience 8)
Batch size (per device) 4
Gradient accumulation 2
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.1
Optimizer metric f1_macro
FP16 true
Class weights inverse-frequency (segment_class_weights.json)

Training config: metadata/logs/run_config_segments_overnight.json
Per-epoch val metrics: metadata/logs/segments_overnight_20260619_015948_epochs.json
Step-level log: metadata/logs/training_log_segments.json

Evaluation results

Held-out test set (BO_7500_Fr.pdf, 147 segments)

Metric Value
Accuracy 0.9728
F1 macro 0.8500
F1 micro 0.9728
Precision macro 0.8492
Recall macro 0.8535

Per-class F1 (test)

Class F1
ARTICLE 1.000
FOOTER 1.000
SOMMAIRE 1.000
TABLE 1.000
TITLE 0.950
PREAMBLE 0.941
ANNEXE_TITLE 0.909
CHAPTER_TITLE 0.000 (0 support in test)

Full metrics: metadata/evaluation/metrics.json
Classification report: metadata/evaluation/classification_report.txt
Aligned predictions: metadata/evaluation/aligned_predictions.json

Evaluation plots

All plots live under evaluation/plots/ (22 files: 7 summary charts + 15 test-set page overlays on BO_7500_Fr.pdf).

Summary charts

Plot File
Color legend (14 BO classes) 00_color_legend.png
Segments per class (dataset) 01_segments_per_class.png
Training / validation curves 01_training_curves.png
Confusion matrix (raw counts) 03_confusion_raw.png
Confusion matrix (normalized) 03_confusion_norm.png
Per-class F1 (test) 04_per_class_f1.png
Train segment distribution 05_train_segment_distribution.png

Color legend

Segments per class

Training curves

Confusion matrix (raw)

Confusion matrix (normalized)

Per-class F1

Train segment distribution

Test-set page overlays (BO_7500_Fr.pdf)

Predicted segment labels overlaid on held-out test pages (one color per BO class; see legend above).

BO_7500_Fr page 002 overlay

BO_7500_Fr page 006 overlay

BO_7500_Fr page 007 overlay

BO_7500_Fr page 016 overlay

BO_7500_Fr page 017 overlay

BO_7500_Fr page 018 overlay

BO_7500_Fr page 019 overlay

BO_7500_Fr page 021 overlay

BO_7500_Fr page 022 overlay

BO_7500_Fr page 023 overlay

BO_7500_Fr page 025 overlay

BO_7500_Fr page 026 overlay

BO_7500_Fr page 029 overlay

BO_7500_Fr page 030 overlay

BO_7500_Fr page 032 overlay

Usage

from transformers import LayoutLMv3ForSequenceClassification, LayoutLMv3Processor
from PIL import Image

model_id = "AvoCahDoe/layoutlmv3-bo-segments"
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = LayoutLMv3ForSequenceClassification.from_pretrained(model_id)

# words: list[str], boxes: list[[x0,y0,x1,y1]] normalized 0-1000
image = Image.open("page.png").convert("RGB")
encoding = processor(
    image, words, boxes=boxes,
    return_tensors="pt", truncation=True, padding="max_length", max_length=512,
)
outputs = model(**{k: v for k, v in encoding.items()})
pred_id = outputs.logits.argmax(-1).item()
label = model.config.id2label[str(pred_id)]

Pipeline integration

Used downstream in the RegionProposal pipeline: PP-DocLayout proposes region boxes → this model classifies each region into BO taxonomy.

python scripts/run_layoutlmv3_on_proposals.py --run-dir runs/BO_7458_fr

Repository layout

config.json                 # Model config + id2label / label2id
model.safetensors           # Fine-tuned weights (~481 MB)
training_args.bin           # HuggingFace TrainingArguments snapshot
metadata/
  DATASET.md                # Full dataset documentation
  segment_label_schema.json
  split_manifest.json
  evaluation/               # Test metrics, reports, predictions
  logs/                     # Training config and curves
evaluation/plots/           # Confusion matrices, training curves, overlays

Limitations

  • Requires pre-segmented regions (GT boxes or proposal boxes from PP-DocLayout); not an end-to-end detector.
  • Trained on 6 French government PDFs — may not generalize to other layouts or languages.
  • Rare classes (CHAPTER_TITLE, FORM, SECTION) have limited test support.
  • OCR quality affects performance (PyMuPDF primary, EasyOCR fallback).

Citation

@misc{layoutlmv3-bo-segments-2026,
  title={LayoutLMv3 Fine-tuned for French Bulletin Officiel Segment Classification},
  author={AvoCahDoe},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/AvoCahDoe/layoutlmv3-bo-segments}}
}

Acknowledgments

Downloads last month
9
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AvoCahDoe/layoutlmv3-bo-segments

Finetuned
(307)
this model

Space using AvoCahDoe/layoutlmv3-bo-segments 1

Evaluation results

  • accuracy on BO segment dataset (6 French government PDFs, 265 pages)
    self-reported
    0.973
  • f1_macro on BO segment dataset (6 French government PDFs, 265 pages)
    self-reported
    0.850
  • f1_micro on BO segment dataset (6 French government PDFs, 265 pages)
    self-reported
    0.973