Below is a draft Hugging Face model card for Clinical ModernBERT. The card emphasizes the masked language modeling setup and describes the pre-training optimizations.
Clinical ModernBERT
Clinical ModernBERT is a state-of-the-art encoder-based transformer tailored specifically for biomedical and clinical text handling context length up to 8192 tokens. Building on the innovations introduced by ModernBERT, this model extends the context window to 8,192 tokens and incorporates domain-specific vocabulary refinements. It is designed to produce semantically rich representations that capture both the nuanced syntax of biomedical literature and the intricate semantics of clinical narratives.
Usage
Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT')
tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')
Model Overview
Below is a table summarizing ModernBERT's key architectural components and their benefits:
Feature | Description | Benefit |
---|---|---|
Extended Context Length | Processes sequences up to 8,192 tokens. | Captures long-range dependencies and full document contexts, essential for complex linguistic tasks. |
GeGLU Activation | Uses the GeGLU activation, a gated variant of GeLU. | Enhances non-linear representation and model stability by allowing controlled information flow. |
Rotary Positional Embeddings | Implements RoPE to encode relative positional information. | Provides robust handling of positional data, especially beneficial for extended contexts. |
Flash Attention | Employs Flash Attention to compute self-attention blockwise. | Reduces memory overhead from quadratic to near-linear complexity, enabling efficient processing of long sequences. |
This model leverages a suite of modern architectural advancements including rotary positional embeddings (RoPE), Flash Attention for near-linear memory usage with extended contexts, and GeGLU activation layers that enhance representational capacity by integrating smooth gating mechanisms. By initializing from a ModernBERT-base checkpoint and applying domain-specific pre-training on approximately 40 million PubMed abstracts combined with MIMIC-IV clinical notes, Clinical ModernBERT is optimized to serve in tasks such as retrieval-augmented generation, fine-grained text classification, and domain-specific entity extraction.
Pre-training Optimizations
Parameter | Value | Description |
---|---|---|
Total Tokens | 13,004,002,816 | Total number of tokens in the unified pre-training corpus |
Pre-training Corpus | PubMed + MIMIC-IV + Medical Codes & Descriptions | Combination of approximately 40M PubMed abstracts and MIMIC-IV Clinical notes and medical code and description pairs (ICD 9 Code 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.) |
Training Steps | 150,000 | Total number of masked language modeling (MLM) training steps |
Batch Size | 128 | Batch size used during training |
The pre-training regimen of Clinical ModernBERT is distinguished by a dynamic, two-phase approach that is both computationally efficient and sensitive to the unique linguistic characteristics of clinical text. In the initial phase, the model trains on sequences limited to 128 tokens using large batches and an elevated learning rate. This phase is optimized with the StableAdamW optimizer, employing a cosine learning rate schedule with a 10% warmup ratio. Mixed-precision techniques are applied to accelerate training while preserving memory efficiency.
Once the model has acquired robust short-span contextual embeddings, the sequence length is extended to 8,192 tokens—a critical modification that enables the model to capture the long-range dependencies present in full-length discharge summaries and radiology reports. In this extended phase, the scaling parameter of the rotary positional embeddings is adjusted from 10,000 to 160,000, ensuring that the increased context does not compromise the relative positioning of tokens. The batch size and learning rate are reduced to maintain stability during this more computationally demanding phase. Additionally, an adaptive sampling strategy is implemented to prioritize well-structured clinical narratives, thereby intensifying the learning signal from the most informative examples.
Masked Language Modeling (MLM) Setup
Clinical ModernBERT is pre-trained using a multi-phase masked language modeling (MLM) strategy. A custom collator dynamically adjusts the masking probability—beginning at 30% and decreasing to 15% over the course of training—to emphasize medically relevant tokens (e.g., drug names, procedural codes). The MLM objective is defined as
where the model predicts masked tokens given their context. The table below summarizes the MLM evaluation framework without detailing specific metric scores:
Metric | Top-1 Accuracy | Top-5 Accuracy | Top-10 Accuracy | Top-25 Accuracy |
---|---|---|---|---|
Value (%) | 63.31 | 79.67 | 83.33 | 88.10 |
This structure captures the granularity at which the model’s recovery of masked tokens is evaluated, with the understanding that higher top-(k) values reflect a broader lexical recall, and the model consistently ranks clinically appropriate tokens among its predictions.
Intended Use
Clinical ModernBERT is ideally suited for tasks that demand an in-depth understanding of biomedical language. It is particularly valuable for clinical information retrieval, narrative classification, and structured medical coding. Researchers and practitioners may fine-tune this model for specialized downstream applications such as electronic health record analysis, clinical decision support systems, and evidence-based medical literature retrieval.
Citations and Pre-training Source Code
The source code can be found here: Clinical ModernBERT Github
Citing Model
@misc{simon_lee_2025,
author = { Simon Lee },
title = { Clinical_ModernBERT (Revision 24e72d6) },
year = 2025,
url = { https://huggingface.co/Simonlee711/Clinical_ModernBERT },
doi = { 10.57967/hf/4999 },
publisher = { Hugging Face }
}
Citing Paper
@misc{lee2025clinicalmodernbertefficientlong,
title={Clinical ModernBERT: An efficient and long context encoder for biomedical text},
author={Simon A. Lee and Anthony Wu and Jeffrey N. Chiang},
year={2025},
eprint={2504.03964},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.03964},
}
Questions
email (simonlee711@g.ucla.edu)
- Downloads last month
- 61
Model tree for Simonlee711/Clinical_ModernBERT
Base model
answerdotai/ModernBERT-base