gpt1 / README.md
Alexandru Gherghescu
Add original model weigts + conversion script
bbb5d39 unverified
|
raw
history blame
1.63 kB
---
language: en
license: mit
---
# Model description
This repository aims to re-create the GPT 1 architecture, using HuggingFace's
`transformers`.
The original paper of the model can be found [here][gpt1-paper]. The blog post
accompanying this paper is [here][gpt1-blog]. The code and weights can be found
[here][gpt1-code].
The original model was trained, as noted in OpenAI's blogpost, 1 month on 8
GPU's (P600's), on the original BookCorpus dataset (containing around ~7000
books).
This model instead is trained using the [BookCorpusOpen][bco-dataset] dataset,
which contains ~17.000 books (around ~6GB). The tokenized dataset (~9GB) can be
found in `data/` in this repository. The tokenizer is a BPE tokenizer, with
40.000 vocabulary merges, as the original paper. It is re-implemented using
HuggingFace `tokenizers` library, and trained on the
[BookCorpusOpen][bco-dataset] dataset.
[gpt1-paper]:
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[gpt1-blog]: https://openai.com/research/language-unsupervised
[gpt1-code]: https://github.com/openai/finetune-transformer-lm/
[bco-dataset]: https://huggingface.co/datasets/lucadiliello/bookcorpusopen
# How to use
See `preprocessing.py` on how the data was preprocessed and tokenized.
See `pre_training.py` on how the model was pre-trained.
See `inference.py` for an example.
## Converted model
Inside `gpt1-converted-weights/` is the converted safetensors model from the
original weights, which can be used directly with the code inside this repo. The
conversion script and original weights can also be found there.