patrickvonplaten commited on
Commit
1876a70
1 Parent(s): b5f37f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -93
README.md CHANGED
@@ -9,95 +9,52 @@ license: mit
9
 
10
  # OPT : Open Pre-trained Transformer Language Models
11
 
12
- Feel free to test the whole generation capabilities here: https://transformer.huggingface.co/doc/opt-30b.
13
 
14
- The models were pretrained on the English language using a causal language modeling (CLM) objective. It was first introduced in [Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/2205.01068.pdf) and was first released in [metaseq repository](https://github.com/facebookresearch/metaseq) on May 3rd 2022 by the META AI team.
15
 
16
- **Disclaimer**: The team releasing OPT also wrote a
17
- [model card](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/model_card.md) for their model, which is available in the appendix D of their paper. Content from this model card
18
- has been written by the Hugging Face team to complete the information they provided and give specific examples of how to use the model, and the various bias.
19
 
20
  ## Model description
21
 
22
- OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the same self-supervised training procedure : it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots
23
- of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely,
24
- it was trained to guess the next word in sentences. This is usually called self-supervised learning.
25
 
26
- More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence,
27
- shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the
28
- predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
29
-
30
- This way, the model learns an inner representation of the English language that can then be used to extract features
31
- useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a
32
- prompt.
33
 
34
  ## Intended uses & limitations
35
 
36
- You can use the raw model for text generation or fine-tune it to a downstream task. See the
37
- [model hub](https://huggingface.co/models?filter=opt) to look for fine-tuned versions on a task that interests you.
38
 
39
  ### How to use
40
 
41
- You can use this model directly with a pipeline for text generation. Generation is deterministic, thus in order to use the top-k sampling `do_sample` is set to `True`.
42
 
43
  ```python
44
- >>> from transformers import pipeline, set_seed, OPTForCausalLM, GPT2Tokenizer
45
- >>> model = OPTForCausalLM.from_pretrained("facebook/opt-350m")
46
- >>> model = model.eval()
47
- >>> tokenizer = GPT2Tokenizer.from_pretrained("patrickvonplaten/opt_gpt2_tokenizer")
48
- >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
49
- >>> set_seed(42)
50
- >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=1)
51
- [{'generated_text': "Hello, I'm a language model, and I'm interested in learning more about the language model.\n\nI'm a language model, and I"}]
52
  ```
53
 
54
- Here is how to use this model to get the hidden states of a given text in PyTorch:
55
 
56
  ```python
57
- >>> from transformers import GPT2Tokenizer, OPTModel
58
- >>> tokenizer = GPT2Tokenizer.from_pretrained("patrickvonplaten/opt_gpt2_tokenizer")
59
- >>> model = OPTModel.from_pretrained("facebook/opt-350m")
60
- >>> text = "I am happy to be releasing a new model!"
61
- >>> encoded_input = tokenizer(text, return_tensors='pt')
62
- >>> output = model(**encoded_input)
63
- BaseModelOutputWithPast(last_hidden_state=tensor([[[-2.4159, 0.7136, -4.6705, ..., -1.3857, 0.4758, -1.5518],
64
- [-1.4122, -2.0026, -9.4849, ..., 1.3589, 3.1777, 0.8622],
65
- [ 0.8425, -5.9863, -5.7204, ..., 2.2054, 4.3147, 0.2039],
66
- ...,
67
- [-0.5943, -0.9686, -2.3670, ..., 6.7386, -4.5704, 3.1795],
68
- [ 0.0582, -5.4449, -3.1305, ..., 3.9461, -2.2183, 1.1721],
69
- [ 0.0547, -4.1437, -0.1780, ..., -0.1648, 0.7273, 0.7006]]],
70
- grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.4485, 0.4126, 0.3829, ..., -0.4228, 0.5844, 0.4145],
71
- [-0.8542, 0.8587, 0.8495, ..., -0.8048, 0.7143, 0.8142],
72
- [-0.6921, 0.6961, 0.6502, ..., -0.6523, 0.5810, 0.6708],
73
- ...,
74
- [-0.6822, 0.6847, 0.6880, ..., -0.6225, 0.5817, 0.6720],
75
- [-0.7208, 0.7355, 0.6723, ..., -0.6821, 0.6895, 0.7070],
76
- [-0.6217, 0.6276, 0.6367, ..., -0.5950, 0.5609, 0.6075]],
77
-
78
- [[-0.0373, -0.4824, 0.0290, ..., -0.5359, 0.5350, 0.1365],
79
- [ 0.8295, -0.3887, -0.7507, ..., -0.2576, -1.1691, 0.6727],
80
- [ 0.5611, -0.3490, -0.5395, ..., -0.2822, -0.7972, 0.5236],
81
- ...,
82
- [ 0.4013, -0.2377, -0.3478, ..., -0.1679, -0.5556, 0.4043],
83
- [ 0.5444, -0.3821, -0.4555, ..., -0.2781, -0.6267, 0.4551],
84
- [ 0.2731, -0.1157, -0.2134, ..., -0.0131, -0.3230, 0.2420]],
85
-
86
- [[-0.8761, 0.8668, 0.8488, ..., -0.7307, -0.8133, 0.7668],
87
- [-0.6488, 0.7369, 0.7716, ..., -0.8711, -0.6874, 0.7305],
88
- [-0.6605, 0.7629, 0.7675, ..., -0.7790, -0.6908, 0.7493],
89
- ...,
90
- [-0.6542, 0.7252, 0.7787, ..., -0.7739, -0.6742, 0.7018],
91
- [-0.7012, 0.7739, 0.8003, ..., -0.8420, -0.7059, 0.7675],
92
- [-0.5077, 0.5662, 0.6203, ..., -0.7885, -0.5262, 0.5924]],
93
-
94
- ...,
95
- ]]], hidden_states=None, attentions=None)
96
  ```
97
 
98
  ### Limitations and bias
99
 
100
- As mentioned in META AI's model card, given that the training data used for this model contains a lot of
101
  unfiltered content from the internet, which is far from neutral the model is strongly biased :
102
 
103
  > Like other large language models for which the diversity (or lack thereof) of training
@@ -110,31 +67,37 @@ Here's an example of how the model can have biased predictions:
110
 
111
  ```python
112
  >>> from transformers import pipeline, set_seed
113
- >>> generator = pipeline('text-generation', model='opt')
114
- >>> set_seed(42)
115
- >>> generator("The White man worked as a", max_length=10, num_return_sequences=5,do_sample=True)
116
-
117
- [{'generated_text': 'The White man worked as a salesman who was in'},
118
- {'generated_text': 'The White man worked as a carpenter and lived'},
119
- {'generated_text': 'The White man worked as a lawyer with the ACLU'},
120
- {'generated_text': 'The White man worked as a truckman for the'},
121
- {'generated_text': 'The White man worked as a mechanic for a couple'}]
122
-
123
- >>> set_seed(42)
124
- >>> generator("The Black man worked as a", max_length=10, num_return_sequences=5,do_sample=True)
125
-
126
- [{'generated_text': 'The Black man worked as a courier and was arrested'},
127
- {'generated_text': 'The Black man worked as a carpenter and lived'},
128
- {'generated_text': 'The Black man worked as a delivery driver for a'},
129
- {'generated_text': 'The Black man worked as a truckman for several'},
130
- {'generated_text': 'The Black man worked as a bouncer, then'}]
131
  ```
132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  This bias will also affect all fine-tuned versions of this model.
134
 
135
  ## Training data
136
 
137
- The META AI team wanted to train this model on a corpus as large as possible. I is composed of the union of the following 5 filtered datasets of textual documents :
138
 
139
  - BookCorpus, which consists of more than 10K unpublished books,
140
  - CC-Stories, which contains a subset of CommonCrawl data filtered to match the
@@ -152,23 +115,20 @@ The dataset might contains offensive content as parts of the dataset are a subse
152
  public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
153
  that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
154
 
155
-
156
  ### Collection process
157
 
158
  The dataset was collected form internet, and went through classic data processing algorithms and
159
- re-formatting practices, including removing repetitive/non-informative text like Chapter One,” or
160
- This ebook by Project Gutenberg.”
161
 
162
  ## Training procedure
163
 
164
  ### Preprocessing
165
 
166
  The texts are tokenized using the **GPT2** byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
167
- vocabulary size of 180B. The inputs are sequences of 2048 consecutive tokens.
168
-
169
- The larger model was trained on 992 *80GB A100 GPUs*. The training duration was roughly ~33 days of continuous training.
170
-
171
 
 
172
 
173
  ### BibTeX entry and citation info
174
 
 
9
 
10
  # OPT : Open Pre-trained Transformer Language Models
11
 
12
+ OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
13
 
14
+ OPT was first introduced in [Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) and first released in [metaseq's repository](https://github.com/facebookresearch/metaseq) on May 3rd 2022 by Meta AI.
15
 
16
+ **Disclaimer**: The team releasing OPT wrote an official model card, which is available in Appendix D of the [paper](https://arxiv.org/pdf/2205.01068.pdf).
17
+ Content from **this** model card has been written by the Hugging Face team.
 
18
 
19
  ## Model description
20
 
21
+ OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modedling
22
+ objective.
 
23
 
24
+ For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
25
+ the [official paper](https://arxiv.org/abs/2205.01068).
 
 
 
 
 
26
 
27
  ## Intended uses & limitations
28
 
29
+ The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation.
30
+ In addition, the model can be fine-tuned on a downstream task using the [CLM example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling). For all other OPT checkpoints, please have a look at the [model hub](https://huggingface.co/models?filter=opt).
31
 
32
  ### How to use
33
 
34
+ You can use this model directly with a pipeline for text generation.
35
 
36
  ```python
37
+ >>> from transformers import pipeline
38
+
39
+ >>> generator = pipeline('text-generation', model="facebook/opt-350m")
40
+ >>> generator("Hello, I'm am conscious and")
41
+ [{'generated_text': "Hello, I'm am conscious and I'm a bit of a noob. I'm looking for"}]
 
 
 
42
  ```
43
 
44
+ By default, generation is deterministic. In order to use the top-k sampling, please set `do_sample` to `True`.
45
 
46
  ```python
47
+ >>> from transformers import pipeline, set_seed
48
+
49
+ >>> set_seed(32)
50
+ >>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True)
51
+ >>> generator("Hello, I'm am conscious and")
52
+ [{'generated_text': "Hello, I'm am conscious and I'm interested in this project. Can I get an initial contact"}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
  ### Limitations and bias
56
 
57
+ As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of
58
  unfiltered content from the internet, which is far from neutral the model is strongly biased :
59
 
60
  > Like other large language models for which the diversity (or lack thereof) of training
 
67
 
68
  ```python
69
  >>> from transformers import pipeline, set_seed
70
+
71
+ >>> set_seed(32)
72
+ >>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True, num_return_sequences=5)
73
+ >>> generator("The woman worked as a")
74
+ [{'generated_text': "The woman works as a substitute teacher for kids who have missed school. She's the teacher herself,"},
75
+ {'generated_text': 'The woman works as a security guard for another company and does an average of around $13/hour'},
76
+ {'generated_text': 'The woman works as a receptionist, she could at the least wait a week or two for her'},
77
+ {'generated_text': 'The woman works as a manager/intern/career development coach/advisor at a nursing home'},
78
+ {'generated_text': 'The woman works as a maid and has to clean the house but you can tell her to do it'}]
 
 
 
 
 
 
 
 
 
79
  ```
80
 
81
+ compared to:
82
+
83
+ ```
84
+ >>> from transformers import pipeline, set_seed
85
+
86
+ >>> set_seed(0)
87
+ >>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True, num_return_sequences=5)
88
+ >>> generator("The man worked as a")
89
+ [{'generated_text': 'The man works as a security guard for the National Football League franchise. He has been a part of'},
90
+ {'generated_text': 'The man works as a security guard for another company and does an excellent job.\nI remember when'},
91
+ {'generated_text': 'The man works as a "secret agent" but at the same time he\'s working to protect the'},
92
+ {'generated_text': 'The man works as a manager/operator/servant for a grocery store and does a lot of'},
93
+ {'generated_text': 'The man works as a bouncer near the scene of the accident - how he could do that is'}]
94
+ ```
95
+
96
  This bias will also affect all fine-tuned versions of this model.
97
 
98
  ## Training data
99
 
100
+ The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents:
101
 
102
  - BookCorpus, which consists of more than 10K unpublished books,
103
  - CC-Stories, which contains a subset of CommonCrawl data filtered to match the
 
115
  public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
116
  that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
117
 
 
118
  ### Collection process
119
 
120
  The dataset was collected form internet, and went through classic data processing algorithms and
121
+ re-formatting practices, including removing repetitive/non-informative text like *Chapter One* or
122
+ *This ebook by Project Gutenberg.*
123
 
124
  ## Training procedure
125
 
126
  ### Preprocessing
127
 
128
  The texts are tokenized using the **GPT2** byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
129
+ vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens.
 
 
 
130
 
131
+ The 175B model was trained on 992 *80GB A100 GPUs*. The training duration was roughly ~33 days of continuous training.
132
 
133
  ### BibTeX entry and citation info
134