File size: 6,741 Bytes
60a5822
 
 
 
 
 
 
 
4fde4a1
60a5822
4fde4a1
 
 
60a5822
1371256
 
 
 
4fde4a1
60a5822
4fde4a1
60a5822
 
 
 
 
 
 
 
f2ae817
 
 
510e07a
60a5822
f2ae817
 
 
60a5822
f2ae817
 
 
 
 
 
 
 
60a5822
 
 
 
 
 
 
f2ae817
60a5822
 
 
 
 
 
 
bd18710
 
60a5822
 
 
 
6e3f6b4
60a5822
 
 
 
 
71a3160
6e3f6b4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60a5822
e4f5e40
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language: en
tags:
- exbert

---


# OLM GPT-2 October 2022

This is a more up-to-date version of the [original GPT-2](https://huggingface.co/gpt2).
In addition to being more up-to-date, it also tends to perform better than the original GPT2 on standard benchmarks.
It was trained on a cleaned October 2022 snapshot of Common Crawl and Wikipedia.

This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts.
This is important because we want our models to know about events like COVID or 
a presidential election right after they happen.

## Intended uses

You can use the raw model for text generation or fine-tune it to a downstream task.

## How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:

```python
>>> from transformers import pipeline, set_seed
>>> # It is important to include the bad_words_ids=[[0,2]] if you want this model to stay on topic.
>>> # Otherwise, the model may generate start and end tokens followed by text that is not relevant to
>>> # the previous text.
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-oct-2022', bad_words_ids=[[0,2]])
>>> set_seed(42)
>>> # This example also illustrates that sometimes our model generates
>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[
{'generated_text': "Hello, I'm a language model, but you can take me if I want.\nReplyDelete\nReplies\nReply\nAnonymous October 17, 2011"},
{'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The release date for the new release of"},
{'generated_text': "Hello, I'm a language model, I'm not a developer or anybody who's working on those. I'm a freelancer... I"},
{'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just curious about the"},
{'generated_text': "Hello, I'm a language model, I'm passionate about languages, but I don't understand how my system works, the interaction"}
]
```

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-oct-2022')
model = AutoModelForCausalLM.from_pretrained('olm/olm-gpt2-oct-2022')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

## Dataset

The model and tokenizer were trained with this [October 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295) plus this [October 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221001).\
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-october-2022-tokenized-1024).\
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).

## Training

The model was trained according to the OLM GPT2 instructions at this [repo](https://github.com/huggingface/olm-training).

## Evaluation results

The model achieves the following results without any fine-tuning (zero-shot):

| Task        | Metric     | Original GPT2       | OLM GPT2 Oct 2022 (Ours) | Significance of Difference (two-tailed p-value) |
|:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
|rte          |acc         |0.5307               |0.5415                    |0.7188                             |
|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6638**/**0.6670**     |**0.0020**/**0.0002**              |
|copa         |acc         |0.6400               |0.6900                    |0.3000                             |
|record       |f1/em       |**0.7094**/**0.7026**|0.6874/0.6810             |**0.0000**/**0.0000**              |
|boolq        |acc         |0.4872               |**0.5606**                |**0.0000**                         |
|cb           |acc/f1      |0.4101/0.2619        |0.3571/0.1754             |0.4193/NA                          |
|hellaswag    |acc/acc_norm|0.2892/0.3114        |**0.3076**/**0.3491**     |**0.0000**/**0.0000**              |
|mrpc         |acc/f1      |0.5662/0.6911        |**0.6495**/**0.7741**     |**0.0007**/**0.0002**              |
|multirc      |acc         |0.0189               |0.0115                    |0.0959                             |
|lambada      |ppl/acc     |40.0554/0.3256       |**28.6733**/**0.3625**    |**0.0000**/**0.0000**              |
|wsc          |acc         |0.4327               |0.3654                    |0.1679                             |
|wic          |acc         |0.4922               |0.5                       |0.6924                             |
|mnli         |acc         |0.3372               |**0.3471**                |**0.0384**                         |
|qnli         |acc         |0.5017               |0.4981                    |0.5884                             |
|cola         |mcc         |0.0126               |0.0181                    |0.8614                             |
|triviaqa     |acc         |0.0151               |**0.0182**                |**0.0048**                         |
|winogrande   |acc         |0.5162               |0.5114                    |0.7360                             |
|webqs        |acc         |0.0030               |**0.0108**                |**0.0000**                         |
|arc_easy     |acc/acc_norm|0.4381/0.3948        |**0.4651**/**0.4247**     |**0.0082**/**0.0029**              |
|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.1997/0.2329             |0.4132/0.6256                      |

To get these results, we used commit `4f0410a4be0049729078376ce36a42dc308b6e38` of the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
which can produce results different than those reported in the GPT2 paper.
We added a change [here](https://github.com/EleutherAI/lm-evaluation-harness/compare/master...mathemakitten:lm-evaluation-harness:master) to enable evaluation of the OLM GPT2, which has a very slightly different vocab size.
The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.