crumb
/

distilpythia

@@ -13,7 +13,7 @@ language:
 Transformer models have become a popular choice for natural language processing (NLP) tasks due to their ability to handle long-range dependencies and their superior performance on various NLP benchmarks. The transformer model architecture was introduced in 2017 by [Vaswani et al](https://arxiv.org/abs/1706.03762). and has since been used in many state-of-the-art models such as BERT and GPT. The decoder-only transformer model is a variant of the transformer model that has is commonly used for generative tasks in NLP.  It uses masked self-attention to predict the next token in a sequence and has been shown to be powerful at predicting sequences of text.
-Distillation \[[Bucila et al., 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf), [Hinton et al., 2015](https://arxiv.org/abs/1503.02531)\] is a technique used in machine learning to compress a large model into a smaller one that can be used on devices with limited computational resources. In this technique, a smaller model is trained to mimic the behavior of a larger model by learning from its predictions. The smaller model is trained on a smaller dataset than the larger model, which makes it faster and more efficient. This technique has been used to compress models like BERT and GPT-2 into smaller models like DistilBERT and DistilGPT-2, respectively. In this project we apply the technique of knowledge distillation to the second smallest [Pythia](https://arxiv.org/pdf/2304.01373.pdf) model on the [Pile](https://arxiv.org/abs/2101.00027) dataset. The result is a model 43.75% the size of the original which reaches similar performance to the original.
 ### Method

 Transformer models have become a popular choice for natural language processing (NLP) tasks due to their ability to handle long-range dependencies and their superior performance on various NLP benchmarks. The transformer model architecture was introduced in 2017 by [Vaswani et al](https://arxiv.org/abs/1706.03762). and has since been used in many state-of-the-art models such as BERT and GPT. The decoder-only transformer model is a variant of the transformer model that has is commonly used for generative tasks in NLP.  It uses masked self-attention to predict the next token in a sequence and has been shown to be powerful at predicting sequences of text.
+Distillation \[[Bucila et al., 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf), [Hinton et al., 2015](https://arxiv.org/abs/1503.02531)\] is a technique used in machine learning to compress a large model into a smaller one that can be used on devices with limited computational resources. In this technique, a smaller model is trained to mimic the behavior of a larger model by learning from its predictions. The smaller model is trained on a smaller dataset than the larger model, which makes it faster and more efficient. This technique has been used to compress models like BERT and GPT-2 into smaller models like DistilBERT and DistilGPT-2, respectively. In this project we apply the technique of knowledge distillation to the second smallest [Pythia](https://arxiv.org/pdf/2304.01373.pdf) model on the [Pile](https://arxiv.org/abs/2101.00027) dataset.
 ### Method