DoctoBERT-fr-base

DoctoBERT

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

DoctoBERT-fr-base is a French medical encoder for biomedical and clinical NLP. It uses the RoBERTa architecture (111M parameters, 512-token context) and is pretrained from scratch on FineMed-fr and FineMed-rephrased-fr. DoctoBERT leads the academic DrBenchmark, topping both aggregate metrics and five of seven tasks.

Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style.

For long-context use, see its sibling DoctoModernBERT-fr-base.

🚀 How to Use

Fill-mask

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "doctolib-lab/doctobert-fr-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë."
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print(tokenizer.decode(logits[0, masked_index].argmax(-1)))

Using a pipeline:

from transformers import pipeline

fill = pipeline("fill-mask", model="doctolib-lab/doctobert-fr-base")
print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë."))

Fine-tuning

DoctoBERT fine-tunes like any BERT/RoBERTa encoder, with the appropriate task head or framework:

📐 Model Overview

Property Value
Architecture RoBERTa
Parameters 111M total (85M backbone, 26M embeddings)
Layers 12
Hidden size 768
Attention heads 12
MLP GELU
Intermediate size 3072
Context window 512 tokens
Vocabulary size 32,768
Language French

🔧 Training

The tokenizer is a SentencePiece BPE model of 32,768 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters.

DoctoBERT is pretrained from scratch over two phases on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr:

  1. Pretraining (500B tokens). Masked-language-modeling on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations.
  2. Annealing (200B tokens). Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use.

📊 Evaluation

Across 11 encoders from English medical, French generalist, and French medical families, DoctoBERT-fr-base achieves the best DrBenchmark aggregate scores, leading five of seven tasks. On the real-world clinical NER task, scores cluster tightly and its long-context sibling DoctoModernBERT-fr-base achieves the best F1.

DrBenchmark

We adapted the DrBenchmark, filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: Min-Max rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); WP (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).

ModelEMEAMEDLINEE3C-ClinE3C-TempMORFITTDEFT2021DIAMEDMin-MaxWP
English medical
BioBERT58.7750.2955.0278.2966.9956.7259.2629.9715.71
BioClinical-ModernBERT44.7444.4449.5376.1167.4253.9752.070.881.43
ModernBERT-bio56.8446.6053.7678.8568.5756.4361.0629.3517.14
French generalist
CamemBERT65.4356.1859.8283.8171.5462.4060.2669.3757.14
ModernCamemBERT61.9855.4657.6283.1170.0160.0153.2652.6928.57
French medical
DrBERT64.3757.1858.0182.4470.4261.0864.8765.0844.29
CamemBERT-bio64.9859.0361.4084.8871.4864.7364.6380.8370.00
TransBERT-bio-fr67.3759.9662.3684.4874.0465.4870.9193.8888.57
ModernCamemBERT-bio65.3556.8158.6383.3171.2161.3567.7771.3754.29
Ours
DoctoBERT-fr68.3962.5462.7584.6073.3666.4172.5698.1797.14
DoctoModernBERT-fr65.7159.6559.6284.0671.8763.8171.6083.1575.71

Real-world Clinical NER

A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.

ModelPrecisionRecallF1
English medical
BioBERT77.5478.4277.97
BioClinical-ModernBERT78.7978.6978.74
ModernBERT-bio78.0679.3078.67
French generalist
CamemBERT77.1979.5878.36
ModernCamemBERT78.5378.7178.62
French medical
DrBERT76.7777.8177.28
CamemBERT-bio77.5178.9078.19
TransBERT-bio-fr76.8578.6677.74
ModernCamemBERT-bio78.1779.7678.95
Ours
DoctoBERT-fr77.2979.6878.47
DoctoModernBERT-fr79.1279.7179.40

⚠️ Intended Use & Limitations

DoctoBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources.

⚖️ License

Released under Apache-2.0. DoctoBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train doctolib-lab/doctobert-fr-base

Collection including doctolib-lab/doctobert-fr-base

Paper for doctolib-lab/doctobert-fr-base