---
language: en
license: mit
---

# Model description

This repository aims to re-create the GPT 1 architecture, using HuggingFace's
`transformers`.

The original paper of the model can be found [here][gpt1-paper]. The blog post
accompanying this paper is [here][gpt1-blog]. The code and weights can be found
[here][gpt1-code].

The original model was trained, as noted in OpenAI's blogpost, 1 month on 8
GPU's (P600's), on the original BookCorpus dataset (containing around ~7000
books).

This model instead is trained using the [BookCorpusOpen][bco-dataset] dataset,
which contains ~17.000 books (around ~6GB). The tokenized dataset (~9GB) can be
found in `data/` in this repository. The tokenizer is a BPE tokenizer, with
40.000 vocabulary merges, as the original paper. It is re-implemented using
HuggingFace `tokenizers` library, and trained on the
[BookCorpusOpen][bco-dataset] dataset.

[gpt1-paper]:
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[gpt1-blog]: https://openai.com/research/language-unsupervised
[gpt1-code]: https://github.com/openai/finetune-transformer-lm/
[bco-dataset]: https://huggingface.co/datasets/lucadiliello/bookcorpusopen

# How to use

See `preprocessing.py` on how the data was preprocessed and tokenized.

See `pre_training.py` on how the model was pre-trained.

See `inference.py` for an example.

## Converted model

Inside `gpt1-converted-weights/` is the converted safetensors model from the
original weights, which can be used directly with the code inside this repo. The
conversion script and original weights can also be found there.