|
--- |
|
language: en |
|
license: mit |
|
--- |
|
|
|
# Model description |
|
|
|
This repository aims to re-create the GPT 1 architecture, using HuggingFace's |
|
`transformers`. |
|
|
|
The original paper of the model can be found [here][gpt1-paper]. The blog post |
|
accompanying this paper is [here][gpt1-blog]. The code and weights can be found |
|
[here][gpt1-code]. |
|
|
|
The original model was trained, as noted in OpenAI's blogpost, 1 month on 8 |
|
GPU's (P600's), on the original BookCorpus dataset (containing around ~7000 |
|
books). |
|
|
|
This model instead is trained using the [BookCorpusOpen][bco-dataset] dataset, |
|
which contains ~17.000 books (around ~6GB). The tokenized dataset (~9GB) can be |
|
found in `data/` in this repository. The tokenizer is a BPE tokenizer, with |
|
40.000 vocabulary merges, as the original paper. It is re-implemented using |
|
HuggingFace `tokenizers` library, and trained on the |
|
[BookCorpusOpen][bco-dataset] dataset. |
|
|
|
[gpt1-paper]: |
|
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf |
|
[gpt1-blog]: https://openai.com/research/language-unsupervised |
|
[gpt1-code]: https://github.com/openai/finetune-transformer-lm/ |
|
[bco-dataset]: https://huggingface.co/datasets/lucadiliello/bookcorpusopen |
|
|
|
# How to use |
|
|
|
See `preprocessing.py` on how the data was preprocessed and tokenized. |
|
|
|
See `pre_training.py` on how the model was pre-trained. |
|
|
|
See `inference.py` for an example. |
|
|
|
## Converted model |
|
|
|
Inside `gpt1-converted-weights/` is the converted safetensors model from the |
|
original weights, which can be used directly with the code inside this repo. The |
|
conversion script and original weights can also be found there. |
|
|