Add arXiv details
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ license_link: >-
|
|
9 |
|
10 |
Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
|
11 |
|
12 |
-
Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper]() for more details.
|
13 |
|
14 |
Minitron models are for research and development only.
|
15 |
|
@@ -60,7 +60,8 @@ If you find our work helpful, please consider citing our paper:
|
|
60 |
@article{minitron2024,
|
61 |
title={Compact Language Models via Pruning and Knowledge Distillation},
|
62 |
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
|
63 |
-
journal={arXiv preprint arXiv:
|
64 |
-
year={2024}
|
|
|
65 |
}
|
66 |
```
|
|
|
9 |
|
10 |
Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
|
11 |
|
12 |
+
Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper](https://arxiv.org/abs/2407.14679) for more details.
|
13 |
|
14 |
Minitron models are for research and development only.
|
15 |
|
|
|
60 |
@article{minitron2024,
|
61 |
title={Compact Language Models via Pruning and Knowledge Distillation},
|
62 |
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
|
63 |
+
journal={arXiv preprint arXiv:2407.14679},
|
64 |
+
year={2024},
|
65 |
+
url={https://arxiv.org/abs/2407.14679},
|
66 |
}
|
67 |
```
|