Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora, followed by on-premise pre-training on 5GB of Electronic Health records mixed with 2GB of the public set.
Data statistics
Sources:
- Dutch: medical guidelines (FMS, NHG)
- Dutch: NtvG papers
- Dutch: Cardiovascular Electronic Health Records
- English: Pubmed abstracts
- English: PMC abstracts translated using DeepL
- English: Apollo guidelines, papers and books
- English: Meditron guidelines
- English: MIMIC3
- English: MIMIC CXR
- English: MIMIC4
All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
- Number of tokens: 20B
- Number of documents: 32M
Training
- Effective batch size: 5120
- Learning rate: 2e-4
- Weight decay: 1e-3
- Learning schedule: linear, with 5_000 warmup steps
- Num epochs: ~3 (off-premise) followed by 3 (on-premise)
Train perplexity: 2.5 Validation perplexity: 3.4
Acknowledgement
This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.
We were happy to be able to use the Google TPU research cloud for training the model.
- Downloads last month
- 15
Model tree for UMCU/CardioBERTa.nl_clinical
Base model
CLTL/MedRoBERTa.nl