Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ language:
|
|
13 |
|
14 |
Transformer models have become a popular choice for natural language processing (NLP) tasks due to their ability to handle long-range dependencies and their superior performance on various NLP benchmarks. The transformer model architecture was introduced in 2017 by [Vaswani et al](https://arxiv.org/abs/1706.03762). and has since been used in many state-of-the-art models such as BERT and GPT. The decoder-only transformer model is a variant of the transformer model that has is commonly used for generative tasks in NLP. It uses masked self-attention to predict the next token in a sequence and has been shown to be powerful at predicting sequences of text.
|
15 |
|
16 |
-
Distillation \[[Bucila et al., 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf), [Hinton et al., 2015](https://arxiv.org/abs/1503.02531)\] is a technique used in machine learning to compress a large model into a smaller one that can be used on devices with limited computational resources. In this technique, a smaller model is trained to mimic the behavior of a larger model by learning from its predictions. The smaller model is trained on a smaller dataset than the larger model, which makes it faster and more efficient. This technique has been used to compress models like BERT and GPT-2 into smaller models like DistilBERT and DistilGPT-2, respectively. In this project we apply the technique of knowledge distillation to the second smallest [Pythia](https://arxiv.org/pdf/2304.01373.pdf) model on the [Pile](https://arxiv.org/abs/2101.00027) dataset.
|
17 |
|
18 |
### Method
|
19 |
|
|
|
13 |
|
14 |
Transformer models have become a popular choice for natural language processing (NLP) tasks due to their ability to handle long-range dependencies and their superior performance on various NLP benchmarks. The transformer model architecture was introduced in 2017 by [Vaswani et al](https://arxiv.org/abs/1706.03762). and has since been used in many state-of-the-art models such as BERT and GPT. The decoder-only transformer model is a variant of the transformer model that has is commonly used for generative tasks in NLP. It uses masked self-attention to predict the next token in a sequence and has been shown to be powerful at predicting sequences of text.
|
15 |
|
16 |
+
Distillation \[[Bucila et al., 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf), [Hinton et al., 2015](https://arxiv.org/abs/1503.02531)\] is a technique used in machine learning to compress a large model into a smaller one that can be used on devices with limited computational resources. In this technique, a smaller model is trained to mimic the behavior of a larger model by learning from its predictions. The smaller model is trained on a smaller dataset than the larger model, which makes it faster and more efficient. This technique has been used to compress models like BERT and GPT-2 into smaller models like DistilBERT and DistilGPT-2, respectively. In this project we apply the technique of knowledge distillation to the second smallest [Pythia](https://arxiv.org/pdf/2304.01373.pdf) model on the [Pile](https://arxiv.org/abs/2101.00027) dataset.
|
17 |
|
18 |
### Method
|
19 |
|