Instructions to use doctolib-lab/doctobert-fr-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use doctolib-lab/doctobert-fr-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="doctolib-lab/doctobert-fr-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("doctolib-lab/doctobert-fr-base") model = AutoModelForMaskedLM.from_pretrained("doctolib-lab/doctobert-fr-base") - Notebooks
- Google Colab
- Kaggle
DoctoBERT-fr-base
🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT
📚 Introduction
DoctoBERT-fr-base is a French medical encoder for biomedical and clinical NLP. It uses the RoBERTa architecture (111M parameters, 512-token context) and is pretrained from scratch on FineMed-fr and FineMed-rephrased-fr. DoctoBERT leads the academic DrBenchmark, topping both aggregate metrics and five of seven tasks.
Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style.
For long-context use, see its sibling DoctoModernBERT-fr-base.
🚀 How to Use
Fill-mask
Using AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "doctolib-lab/doctobert-fr-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë."
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print(tokenizer.decode(logits[0, masked_index].argmax(-1)))
Using a pipeline:
from transformers import pipeline
fill = pipeline("fill-mask", model="doctolib-lab/doctobert-fr-base")
print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë."))
Fine-tuning
DoctoBERT fine-tunes like any BERT/RoBERTa encoder, with the appropriate task head or framework:
- For sequence classification, load it with
AutoModelForSequenceClassification(see the text-classification guide). - For token classification (NER), use
AutoModelForTokenClassification(see the token-classification guide). - For embeddings / retrieval, use Sentence Transformers or PyLate.
📐 Model Overview
| Property | Value |
|---|---|
| Architecture | RoBERTa |
| Parameters | 111M total (85M backbone, 26M embeddings) |
| Layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| MLP | GELU |
| Intermediate size | 3072 |
| Context window | 512 tokens |
| Vocabulary size | 32,768 |
| Language | French |
🔧 Training
The tokenizer is a SentencePiece BPE model of 32,768 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters.
DoctoBERT is pretrained from scratch over two phases on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr:
- Pretraining (500B tokens). Masked-language-modeling on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations.
- Annealing (200B tokens). Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use.
📊 Evaluation
Across 11 encoders from English medical, French generalist, and French medical families, DoctoBERT-fr-base achieves the best DrBenchmark aggregate scores, leading five of seven tasks. On the real-world clinical NER task, scores cluster tightly and its long-context sibling DoctoModernBERT-fr-base achieves the best F1.
DrBenchmark
We adapted the DrBenchmark, filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: Min-Max rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); WP (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).
| Model | EMEA | MEDLINE | E3C-Clin | E3C-Temp | MORFITT | DEFT2021 | DIAMED | Min-Max | WP |
|---|---|---|---|---|---|---|---|---|---|
| English medical | |||||||||
| BioBERT | 58.77 | 50.29 | 55.02 | 78.29 | 66.99 | 56.72 | 59.26 | 29.97 | 15.71 |
| BioClinical-ModernBERT | 44.74 | 44.44 | 49.53 | 76.11 | 67.42 | 53.97 | 52.07 | 0.88 | 1.43 |
| ModernBERT-bio | 56.84 | 46.60 | 53.76 | 78.85 | 68.57 | 56.43 | 61.06 | 29.35 | 17.14 |
| French generalist | |||||||||
| CamemBERT | 65.43 | 56.18 | 59.82 | 83.81 | 71.54 | 62.40 | 60.26 | 69.37 | 57.14 |
| ModernCamemBERT | 61.98 | 55.46 | 57.62 | 83.11 | 70.01 | 60.01 | 53.26 | 52.69 | 28.57 |
| French medical | |||||||||
| DrBERT | 64.37 | 57.18 | 58.01 | 82.44 | 70.42 | 61.08 | 64.87 | 65.08 | 44.29 |
| CamemBERT-bio | 64.98 | 59.03 | 61.40 | 84.88 | 71.48 | 64.73 | 64.63 | 80.83 | 70.00 |
| TransBERT-bio-fr | 67.37 | 59.96 | 62.36 | 84.48 | 74.04 | 65.48 | 70.91 | 93.88 | 88.57 |
| ModernCamemBERT-bio | 65.35 | 56.81 | 58.63 | 83.31 | 71.21 | 61.35 | 67.77 | 71.37 | 54.29 |
| Ours | |||||||||
| DoctoBERT-fr | 68.39 | 62.54 | 62.75 | 84.60 | 73.36 | 66.41 | 72.56 | 98.17 | 97.14 |
| DoctoModernBERT-fr | 65.71 | 59.65 | 59.62 | 84.06 | 71.87 | 63.81 | 71.60 | 83.15 | 75.71 |
Real-world Clinical NER
A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.
| Model | Precision | Recall | F1 |
|---|---|---|---|
| English medical | |||
| BioBERT | 77.54 | 78.42 | 77.97 |
| BioClinical-ModernBERT | 78.79 | 78.69 | 78.74 |
| ModernBERT-bio | 78.06 | 79.30 | 78.67 |
| French generalist | |||
| CamemBERT | 77.19 | 79.58 | 78.36 |
| ModernCamemBERT | 78.53 | 78.71 | 78.62 |
| French medical | |||
| DrBERT | 76.77 | 77.81 | 77.28 |
| CamemBERT-bio | 77.51 | 78.90 | 78.19 |
| TransBERT-bio-fr | 76.85 | 78.66 | 77.74 |
| ModernCamemBERT-bio | 78.17 | 79.76 | 78.95 |
| Ours | |||
| DoctoBERT-fr | 77.29 | 79.68 | 78.47 |
| DoctoModernBERT-fr | 79.12 | 79.71 | 79.40 |
⚠️ Intended Use & Limitations
DoctoBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources.
⚖️ License
Released under Apache-2.0. DoctoBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources.
🏛️ Acknowledgments
This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.
- Downloads last month
- -