OLM GPT-2 December 2022

This is a more up-to-date version of the original GPT-2. In addition to being more up-to-date, it also tends to perform better than the original GPT2 on standard benchmarks. It was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.

This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts. This is important because we want our models to know about events like COVID or a presidential election right after they happen.

Intended uses

You can use the raw model for text generation or fine-tune it to a downstream task.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> # It is important to include the bad_words_ids=[[0,2]] if you want this model to stay on topic.
>>> # Otherwise, the model may generate start and end tokens followed by text that is not relevant to
>>> # the previous text.
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
[{'generated_text': "Hello, I'm a language model, but you want to know if I have a language in that language. Is this possible? Please explain"},
 {'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The C++ API is becoming more and more popular for"},
 {'generated_text': "Hello, I'm a language model, I'm not trying to learn or understand a new tool, my job is to be as happy as"},
 {'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just a curious guy.\n"},
 {'generated_text': "Hello, I'm a language model, I'm not doing anything that needs to be done for the current time (or previous)."}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-dec-2022')
model = AutoModelForCausalLM.from_pretrained('olm/olm-gpt2-dec-2022')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Dataset

The model and tokenizer were trained with this December 2022 cleaned Common Crawl dataset plus this December 2022 cleaned Wikipedia dataset.
The tokenized version of these concatenated datasets is here.
The datasets were created with this repo.

Training

The model was trained according to the OLM GPT2 instructions at this repo.

Evaluation results

The model achieves the following results without any fine-tuning (zero-shot):

Task	Metric	Original GPT2	OLM GPT2 Dec 2022 (Ours)	Significance of Difference (two-tailed p-value)
rte	acc	0.5307	0.5199	0.7184
piqa	acc/acc_norm	0.6289/0.6251	0.6692/0.6665	0.0004/0.0003
copa	acc	0.6400	0.6800	0.4070
record	f1/em	0.7094/0.7026	0.6884/0.6818	0.0000/0.0000
boolq	acc	0.4872	0.6021	0.0000
cb	acc/f1	0.4107/0.2619	0.3393/0.1840	0.2816/NA
hellaswag	acc/acc_norm	0.2892/0.3114	0.3079/0.3482	0.0000/0.0000
mrpc	acc/f1	0.5662/0.6911	0.6814/0.8099	0.0000/0.0000
multirc	acc	0.0189	0.0220	0.4755
lambada	ppl/acc	40.0554/0.3256	28.3359/0.3699	0.0000/0.0000
wsc	acc	0.4327	0.3654	0.1680
wic	acc	0.4922	0.5000	0.6924
mnli	acc	0.3372	0.3501	0.0071
qnli	acc	0.5017	0.4946	0.2913
cola	mcc	0.0126	0.0000	0.6880
triviaqa	acc	0.0151	0.0181	0.0088
winogrande	acc	0.5162	0.5051	0.4314
webqs	acc	0.0030	0.0079	0.0000
arc_easy	acc/acc_norm	0.4381/0.3948	0.4693/0.4230	0.0022/0.0049
arc_challenge	acc/acc_norm	0.1903/0.2270	0.2090/0.2398	0.1017/0.2957

To get these results, we used commit f079e322b857714fcef1ada9e78ddc606fe51e84 of the Eleuther AI evaluation harness here, which can produce results different than those reported in the GPT2 paper. We added a change here to enable evaluation of the OLM GPT2, which has a very slightly different vocab size. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.