DoctoModernBERT-fr-base

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

DoctoModernBERT-fr-base is a French medical encoder for biomedical and clinical NLP. It uses the ModernBERT architecture (149M parameters, up to 8192-token context) and is pretrained from scratch on FineMed-fr and FineMed-rephrased-fr. DoctoModernBERT performs best on a real-world proprietary clinical NER task and ranks among the top encoders on the academic DrBenchmark.

Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoModernBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style.

For a classic RoBERTa encoder, see its sibling DoctoBERT-fr-base.

🚀 How to Use

Requires a recent transformers with ModernBERT support.

Fill-mask

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "doctolib-lab/doctomodernbert-fr-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë."
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print(tokenizer.decode(logits[0, masked_index].argmax(-1)))

Using a pipeline:

from transformers import pipeline

fill = pipeline("fill-mask", model="doctolib-lab/doctomodernbert-fr-base")
print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë."))

For long inputs, load with attn_implementation="flash_attention_2" for faster, more memory-efficient attention.

Fine-tuning

DoctoModernBERT fine-tunes like any BERT/ModernBERT encoder, with the appropriate task head or framework:

For sequence classification, load it with AutoModelForSequenceClassification (see the text-classification guide).
For token classification (NER), use AutoModelForTokenClassification (see the token-classification guide).
For embeddings / retrieval, use Sentence Transformers or PyLate.

📐 Model Overview

Property	Value
Architecture	ModernBERT
Parameters	149M total (110M backbone, 39M embeddings)
Layers	22
Hidden size	768
Attention heads	12
MLP	GeGLU
Intermediate size	1152
Context window	8192 tokens
Vocabulary size	50,368
Language	French

🔧 Training

The tokenizer is a SentencePiece BPE model of 50,368 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters.

DoctoModernBERT is pretrained from scratch on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr, over three phases totaling 240B tokens:

Pretraining (200B tokens). Masked-language-modeling at 1024-token context on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations.
Context extension (20B tokens). Extends the context window from 1024 to 8192 tokens, training on a subset upsampled toward long documents.
Annealing (20B tokens). Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use.

📊 Evaluation

Across 11 encoders from English medical, French generalist, and French medical families, DoctoModernBERT-fr-base achieves the best F1 on the real-world clinical NER task and ranks among the top encoders on the academic DrBenchmark.

DrBenchmark

We adapted the DrBenchmark, filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: Min-Max rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); WP (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).

Model	EMEA	MEDLINE	E3C-Clin	E3C-Temp	MORFITT	DEFT2021	DIAMED	Min-Max	WP
English medical
BioBERT	58.77	50.29	55.02	78.29	66.99	56.72	59.26	29.97	15.71
BioClinical-ModernBERT	44.74	44.44	49.53	76.11	67.42	53.97	52.07	0.88	1.43
ModernBERT-bio	56.84	46.60	53.76	78.85	68.57	56.43	61.06	29.35	17.14
French generalist
CamemBERT	65.43	56.18	59.82	83.81	71.54	62.40	60.26	69.37	57.14
ModernCamemBERT	61.98	55.46	57.62	83.11	70.01	60.01	53.26	52.69	28.57
French medical
DrBERT	64.37	57.18	58.01	82.44	70.42	61.08	64.87	65.08	44.29
CamemBERT-bio	64.98	59.03	61.40	84.88	71.48	64.73	64.63	80.83	70.00
TransBERT-bio-fr	67.37	59.96	62.36	84.48	74.04	65.48	70.91	93.88	88.57
ModernCamemBERT-bio	65.35	56.81	58.63	83.31	71.21	61.35	67.77	71.37	54.29
Ours
DoctoBERT-fr	68.39	62.54	62.75	84.60	73.36	66.41	72.56	98.17	97.14
DoctoModernBERT-fr	65.71	59.65	59.62	84.06	71.87	63.81	71.60	83.15	75.71

Real-world Clinical NER

A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.

Model	Precision	Recall	F1
English medical
BioBERT	77.54	78.42	77.97
BioClinical-ModernBERT	78.79	78.69	78.74
ModernBERT-bio	78.06	79.30	78.67
French generalist
CamemBERT	77.19	79.58	78.36
ModernCamemBERT	78.53	78.71	78.62
French medical
DrBERT	76.77	77.81	77.28
CamemBERT-bio	77.51	78.90	78.19
TransBERT-bio-fr	76.85	78.66	77.74
ModernCamemBERT-bio	78.17	79.76	78.95
Ours
DoctoBERT-fr	77.29	79.68	78.47
DoctoModernBERT-fr	79.12	79.71	79.40

⚠️ Intended Use & Limitations

DoctoModernBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval, long-document tasks), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources.

⚖️ License

Released under Apache-2.0. DoctoModernBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train doctolib-lab/doctomodernbert-fr-base

Collection including doctolib-lab/doctomodernbert-fr-base

DoctoBERT-fr

Collection

French medical encoders pretrained from scratch on curated and LLM-rephrased medical web data. • 4 items • Updated about 17 hours ago • 3

Paper for doctolib-lab/doctomodernbert-fr-base

Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

Paper • 2606.22079 • Published 4 days ago • 1