Fill-Mask
Transformers
Safetensors
English
modernbert
distillation
knowledge-distillation
model-compression
Instructions to use codechrl/modernbert-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codechrl/modernbert-mini with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="codechrl/modernbert-mini")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("codechrl/modernbert-mini") model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-mini") - Notebooks
- Google Colab
- Kaggle
File size: 4,781 Bytes
0cf3229 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
base_model_relation: finetune
library_name: transformers
pipeline_tag: fill-mask
language:
- en
datasets:
- HuggingFaceFW/fineweb-edu
tags:
- modernbert
- distillation
- knowledge-distillation
- model-compression
- fill-mask
---
# modernbert-mini
DistilBERT-style distillation — the balanced, recommended general base.
A **compressed, fine-tunable base encoder** derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) — the *fork/derivative*:
**46.7% of the teacher's size** while keeping **92.9% of its GLUE quality**. Use it as a general base and
fine-tune on your downstream task, exactly like ModernBERT-base.
## The family (one exercise)
All three were produced in **one ModernBERT compression exercise** — same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. **Pick the tier that fits your size/quality budget:**
- [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
- [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) ← **you are here** — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
- [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization
## How it was made (general process)
1. **Teacher** — `answerdotai/ModernBERT-base` (149.7M params), the distillation target.
2. **General-corpus distillation** — the student learns from the teacher on **FineWeb-Edu** (general English web
text) using the `distilbert` recipe. No task-/domain-specific data, so it stays a general base.
3. **Evaluation** — quality measured on **GLUE** (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically),
reported purely as **% retained vs the teacher**.
## Scores (% against the ModernBERT-base teacher)
- **Size:** 281.2 MB → **46.7% of baseline** (params 69.4M)
- **GLUE quality retained:** **92.9%**
- **eff_score:** 73.1 / 100 = `0.5 · GLUE_retention% + 0.5 · size_reduction%` (higher is better)
### Full tier comparison
| model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score |
|---|---|---|---|---|---|
| `ModernBERT-base` (teacher) | 149.7 | 602.2 | 100% | 100% | 50.0 |
| `modernbert-tiny` | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 |
| **modernbert-mini** ⭐ | 69.4 | 281.2 | 46.7% | 92.9% | 73.1 |
| `modernbert-lite` | 149.7 | 302.9 | 50.3% | 99.3% | 74.5 |
## Methods & architecture (each tier)
Every tier derives from the **same teacher** but uses a different compression method:
### `modernbert-tiny`
*4 transformer layers, hidden size 312, 12 heads (~22M params)*
**TinyBERT-style distillation.** A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.
### `modernbert-mini` ⭐
*6 transformer layers, hidden size 768 (~69M params)*
**DistilBERT-style distillation.** The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.
### `modernbert-lite`
*full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16*
**Half-precision (fp16) quantization.** No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.
## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("codechrl/modernbert-mini")
model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-mini")
# fine-tune for your task:
# from transformers import AutoModelForSequenceClassification
# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-mini", num_labels=N)
```
## Intended use & limitations
- **A base to fine-tune**, not a finished classifier.
- Distilled on a **small compute budget** (demo-grade); for production, redistill with more steps/corpus.
- `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more.
## Citation
Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).
|