NeMo
PyTorch
nemotron
srvm commited on
Commit
7bc9a51
·
1 Parent(s): 615c28e

Add arXiv details

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -9,7 +9,7 @@ license_link: >-
9
 
10
  Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
11
 
12
- Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper]() for more details.
13
 
14
  Minitron models are for research and development only.
15
 
@@ -60,7 +60,8 @@ If you find our work helpful, please consider citing our paper:
60
  @article{minitron2024,
61
  title={Compact Language Models via Pruning and Knowledge Distillation},
62
  author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
63
- journal={arXiv preprint arXiv:XXX},
64
- year={2024}
 
65
  }
66
  ```
 
9
 
10
  Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
11
 
12
+ Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper](https://arxiv.org/abs/2407.14679) for more details.
13
 
14
  Minitron models are for research and development only.
15
 
 
60
  @article{minitron2024,
61
  title={Compact Language Models via Pruning and Knowledge Distillation},
62
  author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
63
+ journal={arXiv preprint arXiv:2407.14679},
64
+ year={2024},
65
+ url={https://arxiv.org/abs/2407.14679},
66
  }
67
  ```