--- language: en license: mit --- # Model description This repository aims to re-create the GPT 1 architecture, using HuggingFace's `transformers`. The original paper of the model can be found [here][gpt1-paper]. The blog post accompanying this paper is [here][gpt1-blog]. The code and weights can be found [here][gpt1-code]. The original model was trained, as noted in OpenAI's blogpost, 1 month on 8 GPU's (P600's), on the original BookCorpus dataset (containing around ~7000 books). This model instead is trained using the [BookCorpusOpen][bco-dataset] dataset, which contains ~17.000 books (around ~6GB). The tokenized dataset (~9GB) can be found in `data/` in this repository. The tokenizer is a BPE tokenizer, with 40.000 vocabulary merges, as the original paper. It is re-implemented using HuggingFace `tokenizers` library, and trained on the [BookCorpusOpen][bco-dataset] dataset. [gpt1-paper]: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf [gpt1-blog]: https://openai.com/research/language-unsupervised [gpt1-code]: https://github.com/openai/finetune-transformer-lm/ [bco-dataset]: https://huggingface.co/datasets/lucadiliello/bookcorpusopen # How to use See `preprocessing.py` on how the data was preprocessed and tokenized. See `pre_training.py` on how the model was pre-trained. See `inference.py` for an example. ## Converted model Inside `gpt1-converted-weights/` is the converted safetensors model from the original weights, which can be used directly with the code inside this repo. The conversion script and original weights can also be found there.