DoctoBERT-fr-base

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

DoctoBERT-fr-base is a French medical encoder for biomedical and clinical NLP. It uses the RoBERTa architecture (111M parameters, 512-token context) and is pretrained from scratch on FineMed-fr and FineMed-rephrased-fr. DoctoBERT leads the academic DrBenchmark, topping both aggregate metrics and five of seven tasks.

Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style.

For long-context use, see its sibling DoctoModernBERT-fr-base.

🚀 How to Use

Fill-mask

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "doctolib-lab/doctobert-fr-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë."
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print(tokenizer.decode(logits[0, masked_index].argmax(-1)))

Using a pipeline:

from transformers import pipeline

fill = pipeline("fill-mask", model="doctolib-lab/doctobert-fr-base")
print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë."))

Fine-tuning

DoctoBERT fine-tunes like any BERT/RoBERTa encoder, with the appropriate task head or framework:

For sequence classification, load it with AutoModelForSequenceClassification (see the text-classification guide).
For token classification (NER), use AutoModelForTokenClassification (see the token-classification guide).
For embeddings / retrieval, use Sentence Transformers or PyLate.

📐 Model Overview

Property	Value
Architecture	RoBERTa
Parameters	111M total (85M backbone, 26M embeddings)
Layers	12
Hidden size	768
Attention heads	12
MLP	GELU
Intermediate size	3072
Context window	512 tokens
Vocabulary size	32,768
Language	French

🔧 Training

The tokenizer is a SentencePiece BPE model of 32,768 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters.

DoctoBERT is pretrained from scratch over two phases on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr:

Pretraining (500B tokens). Masked-language-modeling on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations.
Annealing (200B tokens). Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use.

📊 Evaluation

Across 11 encoders from English medical, French generalist, and French medical families, DoctoBERT-fr-base achieves the best DrBenchmark aggregate scores, leading five of seven tasks. On the real-world clinical NER task, scores cluster tightly and its long-context sibling DoctoModernBERT-fr-base achieves the best F1.

DrBenchmark

We adapted the DrBenchmark, filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: Min-Max rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); WP (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).

Model	EMEA	MEDLINE	E3C-Clin	E3C-Temp	MORFITT	DEFT2021	DIAMED	Min-Max	WP
English medical
BioBERT	58.77	50.29	55.02	78.29	66.99	56.72	59.26	29.97	15.71
BioClinical-ModernBERT	44.74	44.44	49.53	76.11	67.42	53.97	52.07	0.88	1.43
ModernBERT-bio	56.84	46.60	53.76	78.85	68.57	56.43	61.06	29.35	17.14
French generalist
CamemBERT	65.43	56.18	59.82	83.81	71.54	62.40	60.26	69.37	57.14
ModernCamemBERT	61.98	55.46	57.62	83.11	70.01	60.01	53.26	52.69	28.57
French medical
DrBERT	64.37	57.18	58.01	82.44	70.42	61.08	64.87	65.08	44.29
CamemBERT-bio	64.98	59.03	61.40	84.88	71.48	64.73	64.63	80.83	70.00
TransBERT-bio-fr	67.37	59.96	62.36	84.48	74.04	65.48	70.91	93.88	88.57
ModernCamemBERT-bio	65.35	56.81	58.63	83.31	71.21	61.35	67.77	71.37	54.29
Ours
DoctoBERT-fr	68.39	62.54	62.75	84.60	73.36	66.41	72.56	98.17	97.14
DoctoModernBERT-fr	65.71	59.65	59.62	84.06	71.87	63.81	71.60	83.15	75.71

Real-world Clinical NER

A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.

Model	Precision	Recall	F1
English medical
BioBERT	77.54	78.42	77.97
BioClinical-ModernBERT	78.79	78.69	78.74
ModernBERT-bio	78.06	79.30	78.67
French generalist
CamemBERT	77.19	79.58	78.36
ModernCamemBERT	78.53	78.71	78.62
French medical
DrBERT	76.77	77.81	77.28
CamemBERT-bio	77.51	78.90	78.19
TransBERT-bio-fr	76.85	78.66	77.74
ModernCamemBERT-bio	78.17	79.76	78.95
Ours
DoctoBERT-fr	77.29	79.68	78.47
DoctoModernBERT-fr	79.12	79.71	79.40

⚠️ Intended Use & Limitations

DoctoBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources.

⚖️ License

Released under Apache-2.0. DoctoBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train doctolib-lab/doctobert-fr-base

Collection including doctolib-lab/doctobert-fr-base

DoctoBERT-fr

Collection

French medical encoders pretrained from scratch on curated and LLM-rephrased medical web data. • 4 items • Updated about 11 hours ago • 3

Paper for doctolib-lab/doctobert-fr-base

Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

Paper • 2606.22079 • Published 4 days ago • 1