alexghergh
/

gpt1

Model card Files Files and versions Community

gpt1 / README.md

Alexandru Gherghescu

Add original model weigts + conversion script

bbb5d39 unverified 9 months ago

|

1.63 kB

	---
	language: en
	license: mit
	---

	# Model description

	This repository aims to re-create the GPT 1 architecture, using HuggingFace's
	`transformers`.

	The original paper of the model can be found [here][gpt1-paper]. The blog post
	accompanying this paper is [here][gpt1-blog]. The code and weights can be found
	[here][gpt1-code].

	The original model was trained, as noted in OpenAI's blogpost, 1 month on 8
	GPU's (P600's), on the original BookCorpus dataset (containing around ~7000
	books).

	This model instead is trained using the [BookCorpusOpen][bco-dataset] dataset,
	which contains ~17.000 books (around ~6GB). The tokenized dataset (~9GB) can be
	found in `data/` in this repository. The tokenizer is a BPE tokenizer, with
	40.000 vocabulary merges, as the original paper. It is re-implemented using
	HuggingFace `tokenizers` library, and trained on the
	[BookCorpusOpen][bco-dataset] dataset.

	[gpt1-paper]:
	https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
	[gpt1-blog]: https://openai.com/research/language-unsupervised
	[gpt1-code]: https://github.com/openai/finetune-transformer-lm/
	[bco-dataset]: https://huggingface.co/datasets/lucadiliello/bookcorpusopen

	# How to use

	See `preprocessing.py` on how the data was preprocessed and tokenized.

	See `pre_training.py` on how the model was pre-trained.

	See `inference.py` for an example.

	## Converted model

	Inside `gpt1-converted-weights/` is the converted safetensors model from the
	original weights, which can be used directly with the code inside this repo. The
	conversion script and original weights can also be found there.