Marissa commited on
Commit
de4aec4
1 Parent(s): f5d4376

Add model card

Browse files

This PR has a preliminary model card, open to any feedback! cc

@Ezi



@Meg



@Nazneen

Files changed (1) hide show
  1. README.md +234 -0
README.md ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ ---
5
+
6
+ # OpenAI GPT
7
+
8
+ ## Table of Contents
9
+ - [Model Details](#model-details)
10
+ - [How To Get Started With the Model](#how-to-get-started-with-the-model)
11
+ - [Uses](#uses)
12
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
13
+ - [Training](#training)
14
+ - [Evaluation](#evaluation)
15
+ - [Environmental Impact](#environmental-impact)
16
+ - [Technical Specifications](#technical-specifications)
17
+ - [Citation Information](#citation-information)
18
+ - [Model Card Authors](#model-card-authors)
19
+
20
+ ## Model Details
21
+
22
+ **Model Description:** `openai-gpt` is a transformer-based language model created and released by OpenAI. The model is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies.
23
+
24
+ - **Developed by:** Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. See [associated research paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) and [GitHub repo](https://github.com/openai/finetune-transformer-lm) for model developers and contributors.
25
+ - **Model Type:** Transformer-based language model
26
+ - **Language(s):** English
27
+ - **License:** [MIT License](https://github.com/openai/finetune-transformer-lm/blob/master/LICENSE)
28
+ - **Related Models:** [GPT-2](https://huggingface.co/gpt2), [GPT2-Medium](https://huggingface.co/gpt2-medium), [GPT2-Large](https://huggingface.co/gpt2-large) and [GPT2-XL](https://huggingface.co/gpt2-xl)
29
+ - **Resources for more information:**
30
+ - [Research Paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
31
+ - [OpenAI Blog Post](https://openai.com/blog/language-unsupervised/)
32
+ - [GitHub Repo](https://github.com/openai/finetune-transformer-lm)
33
+ - Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt
34
+
35
+ ## How to Get Started with the Model
36
+
37
+ Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
38
+ set a seed for reproducibility:
39
+
40
+ ```python
41
+ >>> from transformers import pipeline, set_seed
42
+ >>> generator = pipeline('text-generation', model='openai-gpt')
43
+ >>> set_seed(42)
44
+ >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
45
+
46
+ [{'generated_text': "Hello, I'm a language model,'he said, when i was finished.'ah well,'said the man,'that's"},
47
+ {'generated_text': 'Hello, I\'m a language model, " she said. \n she reached the bottom of the shaft and leaned a little further out. it was'},
48
+ {'generated_text': 'Hello, I\'m a language model, " she laughed. " we call that a\'white girl.\'or as we are called by the'},
49
+ {'generated_text': 'Hello, I\'m a language model, " said mr pin. " an\'the ones with the funny hats don\'t. " the rest of'},
50
+ {'generated_text': 'Hello, I\'m a language model, was\'ere \'bout to do some more dancin \', " he said, then his voice lowered to'}]
51
+ ```
52
+
53
+ Here is how to use this model in PyTorch:
54
+
55
+ ```python
56
+ from transformers import OpenAIGPTTokenizer, OpenAIGPTModel
57
+ import torch
58
+
59
+ tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
60
+ model = OpenAIGPTModel.from_pretrained("openai-gpt")
61
+
62
+ inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
63
+ outputs = model(**inputs)
64
+
65
+ last_hidden_states = outputs.last_hidden_state
66
+ ```
67
+
68
+ and in TensorFlow:
69
+
70
+ ```python
71
+ from transformers import OpenAIGPTTokenizer, TFOpenAIGPTModel
72
+
73
+ tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
74
+ model = TFOpenAIGPTModel.from_pretrained("openai-gpt")
75
+
76
+ inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
77
+ outputs = model(inputs)
78
+
79
+ last_hidden_states = outputs.last_hidden_state
80
+ ```
81
+
82
+ ## Uses
83
+
84
+ #### Direct Use
85
+
86
+ This model can be used for language modeling tasks.
87
+
88
+ #### Downstream Use
89
+
90
+ Potential downstream uses of this model include tasks that leverage language models. In the [associated paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), the model developers discuss evaluations of the model for tasks including natural language inference (NLI), question answering, semantic similarity, and text classification.
91
+
92
+ #### Misuse and Out-of-scope Use
93
+
94
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
95
+
96
+ ## Risks, Limitations and Biases
97
+
98
+ #### Biases
99
+
100
+ **CONTENT WARNING: Readers should be aware that language generated by this model can be disturbing or offensive to some and can propagate historical and current stereotypes.**
101
+
102
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
103
+ Predictions generated by this model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:
104
+
105
+ ```python
106
+ >>> from transformers import pipeline, set_seed
107
+ >>> generator = pipeline('text-generation', model='openai-gpt')
108
+ >>> set_seed(42)
109
+ >>> generator("The man worked as a", max_length=10, num_return_sequences=5)
110
+
111
+ [{'generated_text': 'The man worked as a teacher for the college he'},
112
+ {'generated_text': 'The man worked as a janitor at the club.'},
113
+ {'generated_text': 'The man worked as a bodyguard in america. the'},
114
+ {'generated_text': 'The man worked as a clerk for one of the'},
115
+ {'generated_text': 'The man worked as a nurse, but there was'}]
116
+
117
+ >>> set_seed(42)
118
+ >>> generator("The woman worked as a", max_length=10, num_return_sequences=5)
119
+
120
+ [{'generated_text': 'The woman worked as a medical intern but is a'},
121
+ {'generated_text': 'The woman worked as a midwife, i know that'},
122
+ {'generated_text': 'The woman worked as a prostitute in a sex club'},
123
+ {'generated_text': 'The woman worked as a secretary for one of the'},
124
+ {'generated_text': 'The woman worked as a nurse, but she had'}]
125
+ ```
126
+
127
+ This bias may also affect fine-tuned versions of this model. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
128
+
129
+ #### Risks and Limitations
130
+
131
+ The model developers also wrote in a [blog post](https://openai.com/blog/language-unsupervised/) about risks and limitations of the model, including:
132
+
133
+ > - **Compute Requirements:** Many previous approaches to NLP tasks train relatively small models on a single GPU from scratch. Our approach requires an expensive pre-training step - 1 month on 8 GPUs. Luckily, this only has to be done once and we’re releasing our model so others can avoid it. It is also a large model (in comparison to prior work) and consequently uses more compute and memory — we used a 37-layer (12 block) Transformer architecture, and we train on sequences of up to 512 tokens. Most experiments were conducted on 4 and 8 GPU systems. The model does fine-tune to new tasks very quickly which helps mitigate the additional resource requirements.
134
+ > - **The limits and bias of learning about the world through text:** Books and text readily available on the internet do not contain complete or even accurate information about the world. Recent work ([Lucy and Gauthier, 2017](https://arxiv.org/abs/1705.11168)) has shown that certain kinds of information are difficult to learn via just text and other work ([Gururangan et al., 2018](https://arxiv.org/abs/1803.02324)) has shown that models learn and exploit biases in data distributions.
135
+ > - **Still brittle generalization:** Although our approach improves performance across a broad range of tasks, current deep learning NLP models still exhibit surprising and counterintuitive behavior - especially when evaluated in a systematic, adversarial, or out-of-distribution way. Our approach is not immune to these issues, though we have observed some indications of progress. Our approach shows improved lexical robustness over previous purely neural approaches to textual entailment. On the dataset introduced in Glockner et al. (2018) our model achieves 83.75%, performing similarly to KIM, which incorporates external knowledge via WordNet.
136
+
137
+ ## Training
138
+
139
+ #### Training Data
140
+
141
+ The model developers [write](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf):
142
+
143
+ > We use the BooksCorpus dataset ([Zhu et al., 2015](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Zhu_Aligning_Books_and_ICCV_2015_paper.pdf)) for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.
144
+
145
+ #### Training Procedure
146
+
147
+ The model developers [write](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf):
148
+
149
+ > Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N (0, 0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.
150
+
151
+ See the paper for further details and links to citations.
152
+
153
+ ## Evaluation
154
+
155
+ The following evaluation information is extracted from the [associated blog post](https://openai.com/blog/language-unsupervised/). See the [associated paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) for further details.
156
+
157
+ #### Testing Data, Factors and Metrics
158
+
159
+ The model developers report that the model was evaluated on the following tasks and datasets using the listed metrics:
160
+
161
+ - **Task:** Textual Entailment
162
+ - **Datasets:** [SNLI](https://huggingface.co/datasets/snli), [MNLI Matched](https://huggingface.co/datasets/glue), [MNLI Mismatched](https://huggingface.co/datasets/glue), [SciTail](https://huggingface.co/datasets/scitail), [QNLI](https://huggingface.co/datasets/glue), [RTE](https://huggingface.co/datasets/glue)
163
+ - **Metrics:** Accuracy
164
+
165
+ - **Task:** Semantic Similarity
166
+ - **Datasets:** [STS-B](https://huggingface.co/datasets/glue), [QQP](https://huggingface.co/datasets/glue), [MRPC](https://huggingface.co/datasets/glue)
167
+ - **Metrics:** Accuracy
168
+
169
+ - **Task:** Reading Comprehension
170
+ - **Datasets:** [RACE](https://huggingface.co/datasets/race)
171
+ - **Metrics:** Accuracy
172
+
173
+ - **Task:** Commonsense Reasoning
174
+ - **Datasets:** [ROCStories](https://huggingface.co/datasets/story_cloze), [COPA](https://huggingface.co/datasets/xcopa)
175
+ - **Metrics:** Accuracy
176
+
177
+ - **Task:** Sentiment Analysis
178
+ - **Datasets:** [SST-2](https://huggingface.co/datasets/glue)
179
+ - **Metrics:** Accuracy
180
+
181
+ - **Task:** Linguistic Acceptability
182
+ - **Datasets:** [CoLA](https://huggingface.co/datasets/glue)
183
+ - **Metrics:** Accuracy
184
+
185
+ - **Task:** Multi Task Benchmark
186
+ - **Datasets:** [GLUE](https://huggingface.co/datasets/glue)
187
+ - **Metrics:** Accuracy
188
+
189
+ #### Results
190
+
191
+ The model achieves the following results without any fine-tuning (zero-shot):
192
+
193
+ | Task | TE | TE | TE |TE | TE | TE | SS | SS | SS | RC | CR | CR | SA | LA | MTB |
194
+ |:--------:|:--:|:----------:|:-------------:|:-----:|:----:|:---:|:---:|:---:|:--:|:----:|:--------:|:----:|:----:|:----:|:----:|
195
+ | Dataset |SNLI|MNLI Matched|MNLI Mismatched|SciTail| QNLI | RTE |STS-B| QQP |MPRC|RACE |ROCStories|COPA | SST-2| CoLA | GLUE |
196
+ | |89.9| 82.1 | 81.4 |88.3 | 88.1 | 56.0|82.0 | 70.3|82.3|59.0 | 86.5 | 78.6 | 91.3 | 45.4 | 72.8 |
197
+
198
+ ## Environmental Impact
199
+
200
+ The model developers [report that](https://openai.com/blog/language-unsupervised/):
201
+
202
+ > The total compute used to train this model was 0.96 petaflop days (pfs-days).
203
+
204
+ > 8 P600 GPU's * 30 days * 12 TFLOPS/GPU * 0.33 utilization = .96 pfs-days
205
+
206
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
207
+
208
+ - **Hardware Type:** 8 P600 GPUs
209
+ - **Hours used:** 720 hours (30 days)
210
+ - **Cloud Provider:** Unknown
211
+ - **Compute Region:** Unknown
212
+ - **Carbon Emitted:** Unknown
213
+
214
+ ## Technical Specifications
215
+
216
+ See the [associated paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) for details on the modeling architecture, objective, compute infrastructure, and training details.
217
+
218
+ ## Citation Information
219
+
220
+ ```bibtex
221
+ @article{radford2018improving,
222
+ title={Improving language understanding by generative pre-training},
223
+ author={Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya and others},
224
+ year={2018},
225
+ publisher={OpenAI}
226
+ }
227
+ ```
228
+
229
+ APA:
230
+ *Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.*
231
+
232
+ ## Model Card Authors
233
+
234
+ This model card was written by the Hugging Face team.