DoctoModernBERT-fr-base

DoctoModernBERT

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

DoctoModernBERT-fr-base is a French medical encoder for biomedical and clinical NLP. It uses the ModernBERT architecture (149M parameters, up to 8192-token context) and is pretrained from scratch on FineMed-fr and FineMed-rephrased-fr. DoctoModernBERT performs best on a real-world proprietary clinical NER task and ranks among the top encoders on the academic DrBenchmark.

Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoModernBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style.

For a classic RoBERTa encoder, see its sibling DoctoBERT-fr-base.

🚀 How to Use

Requires a recent transformers with ModernBERT support.

Fill-mask

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "doctolib-lab/doctomodernbert-fr-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë."
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print(tokenizer.decode(logits[0, masked_index].argmax(-1)))

Using a pipeline:

from transformers import pipeline

fill = pipeline("fill-mask", model="doctolib-lab/doctomodernbert-fr-base")
print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë."))

For long inputs, load with attn_implementation="flash_attention_2" for faster, more memory-efficient attention.

Fine-tuning

DoctoModernBERT fine-tunes like any BERT/ModernBERT encoder, with the appropriate task head or framework:

📐 Model Overview

Property Value
Architecture ModernBERT
Parameters 149M total (110M backbone, 39M embeddings)
Layers 22
Hidden size 768
Attention heads 12
MLP GeGLU
Intermediate size 1152
Context window 8192 tokens
Vocabulary size 50,368
Language French

🔧 Training

The tokenizer is a SentencePiece BPE model of 50,368 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters.

DoctoModernBERT is pretrained from scratch on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr, over three phases totaling 240B tokens:

  1. Pretraining (200B tokens). Masked-language-modeling at 1024-token context on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations.
  2. Context extension (20B tokens). Extends the context window from 1024 to 8192 tokens, training on a subset upsampled toward long documents.
  3. Annealing (20B tokens). Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use.

📊 Evaluation

Across 11 encoders from English medical, French generalist, and French medical families, DoctoModernBERT-fr-base achieves the best F1 on the real-world clinical NER task and ranks among the top encoders on the academic DrBenchmark.

DrBenchmark

We adapted the DrBenchmark, filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: Min-Max rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); WP (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).

ModelEMEAMEDLINEE3C-ClinE3C-TempMORFITTDEFT2021DIAMEDMin-MaxWP
English medical
BioBERT58.7750.2955.0278.2966.9956.7259.2629.9715.71
BioClinical-ModernBERT44.7444.4449.5376.1167.4253.9752.070.881.43
ModernBERT-bio56.8446.6053.7678.8568.5756.4361.0629.3517.14
French generalist
CamemBERT65.4356.1859.8283.8171.5462.4060.2669.3757.14
ModernCamemBERT61.9855.4657.6283.1170.0160.0153.2652.6928.57
French medical
DrBERT64.3757.1858.0182.4470.4261.0864.8765.0844.29
CamemBERT-bio64.9859.0361.4084.8871.4864.7364.6380.8370.00
TransBERT-bio-fr67.3759.9662.3684.4874.0465.4870.9193.8888.57
ModernCamemBERT-bio65.3556.8158.6383.3171.2161.3567.7771.3754.29
Ours
DoctoBERT-fr68.3962.5462.7584.6073.3666.4172.5698.1797.14
DoctoModernBERT-fr65.7159.6559.6284.0671.8763.8171.6083.1575.71

Real-world Clinical NER

A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.

ModelPrecisionRecallF1
English medical
BioBERT77.5478.4277.97
BioClinical-ModernBERT78.7978.6978.74
ModernBERT-bio78.0679.3078.67
French generalist
CamemBERT77.1979.5878.36
ModernCamemBERT78.5378.7178.62
French medical
DrBERT76.7777.8177.28
CamemBERT-bio77.5178.9078.19
TransBERT-bio-fr76.8578.6677.74
ModernCamemBERT-bio78.1779.7678.95
Ours
DoctoBERT-fr77.2979.6878.47
DoctoModernBERT-fr79.1279.7179.40

⚠️ Intended Use & Limitations

DoctoModernBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval, long-document tasks), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources.

⚖️ License

Released under Apache-2.0. DoctoModernBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train doctolib-lab/doctomodernbert-fr-base

Collection including doctolib-lab/doctomodernbert-fr-base

Paper for doctolib-lab/doctomodernbert-fr-base