language:
- en
- zh
- es
- ja
- ru
tags:
- fill-mask
- clinical-nlp
- multilingual
- bert
license: mit
MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes
MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.
To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.
Model Overview
MultiClinicalBERT is initialized from bert-base-multilingual-cased and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.
The model captures:
- Clinical terminology and documentation patterns
- Cross-lingual representations for medical text
- Robust performance across diverse healthcare datasets
Pretraining Data
The model is trained on a mixture of three data sources:
1. Clinical Data (BRIDGE Corpus)
- 87 multilingual clinical datasets
- ~1.42M documents
- ~995M tokens
- Languages: English, Chinese, Spanish, Japanese, Russian
This dataset reflects real-world clinical practice and is the core contribution of this work.
2. Biomedical Literature (PubMed)
- ~1.25M documents
- ~194M tokens
Provides domain knowledge and medical terminology.
3. General-Domain Text (Wikipedia)
- ~5.8K documents
- ~43M tokens
- Languages: Spanish, Japanese, Russian
Improves general linguistic coverage.
Total
- ~2.7M documents >1.2B tokens
Pretraining Strategy
We adopt a two-stage domain-adaptive pretraining approach:
Stage 1: Mixed-domain pretraining
- Data: BRIDGE + PubMed + Wikipedia
- Goal: Inject biomedical and multilingual knowledge
Stage 2: Clinical-specific adaptation
- Data: BRIDGE only
- Goal: Learn fine-grained clinical language patterns
Objective
- Masked Language Modeling (MLM)
- 15% token masking
Evaluation
We evaluate MultiClinicalBERT on 11 clinical NLP tasks across 5 languages:
- English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
- Chinese: CEMR, IMCS-V2 NER
- Japanese: IFMIR NER, IFMIR Incident Type
- Russian: RuMedNLI, RuCCoNNER
- Spanish: De-identification, PPTS
Key Results
- Consistently outperforms multilingual BERT (mBERT)
- Matches or exceeds strong language-specific models
- Largest gains observed in low-resource settings
- Statistically significant improvements (Welch’s t-test, p < 0.05)
Example:
- MedNLI: 83.90% accuracy
- CEMR: 86.38% accuracy
- IFMIR NER: 85.53 F1
- RuMedNLI: 78.31% accuracy
Key Contributions
- First BERT model pretrained on multilingual real-world clinical notes
- Large-scale clinical corpus (BRIDGE) with diverse languages
- Effective two-stage domain adaptation strategy
- Strong performance across multiple languages and tasks
- Suitable for:
- Clinical NLP
- Multilingual medical text understanding
- Retrieval-augmented generation (RAG)
- Clinical decision support systems
Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")