J38 commited on
Commit
aa24ede
1 Parent(s): a5c5c89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -13
README.md CHANGED
@@ -10,7 +10,6 @@ As an autoregressive language model, PubMed GPT 2.7B is also capable of natural
10
 
11
  This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
12
 
13
-
14
  # Table of Contents
15
 
16
  - [Model Card for Pubmed GPT 2.7B](#model-card-for--model_id-)
@@ -32,8 +31,6 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
32
  - [Model Architecture and Objective](#model-architecture-and-objective)
33
  - [Compute Infrastructure](#compute-infrastructure)
34
 
35
-
36
-
37
  # Model Details
38
 
39
  ## Model Description
@@ -61,21 +58,18 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
61
 
62
  It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be directly used for production or work that may directly impact people.
63
 
64
-
65
  ## Downstream Use
66
 
67
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
68
 
69
  The main way we have used this model is finetuning for downstream question answering tasks, and we recommend using this model that way.
70
 
71
-
72
  ## Out-of-Scope Use
73
 
74
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
75
 
76
  We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
77
 
78
-
79
  # Bias, Risks, and Limitations
80
 
81
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -83,13 +77,11 @@ We do not recommend using this model for natural language generation in a produc
83
 
84
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
85
 
86
-
87
  ## Recommendations
88
 
89
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
90
  While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations. Understanding these limitations is especially important in a domain like medicine. Therefore, **we strongly recommend against using this model in production for natural language generation.**
91
 
92
-
93
  # Training Details
94
 
95
  ## Training Data
@@ -98,7 +90,6 @@ While this model is capable of generating natural language text, we have only be
98
 
99
  This model was trained on the Pubmed Abstracts and Full Text from [The Pile](https://pile.eleuther.ai/).
100
 
101
-
102
  ## Training Procedure
103
 
104
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
@@ -115,17 +106,14 @@ The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a pla
115
 
116
  The training process was very smooth and did not suffer from any divergences.
117
 
118
-
119
  As we were preparing the training run, we were unsure of the benefits of training out to 300B tokens for language model perplexity and downstream task performance. While most models of this scale (e.g. GPT Neo 2.7B) are trained to 300-400B tokens, the datasets those models use are vastly larger than PubMed. For instance, The Pile is 8x the size of its PubMed subcorpora.
120
 
121
-
122
  Fortunately, we did continue to see steady perplexity improvements on the validation and training sets for the entirety of training, and preliminary experiments showed improved downstream task performance as we trained out to the full 300B tokens. Our takeaway from this was that it was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models.
123
 
124
  ### Preprocessing
125
 
126
  The model uses a custom tokenizer trained on the PubMed Abstracts. When building domain specific models we have found it important to use a tokenizer trained on in-domain text to maximize performance on downstream tasks. A key benefit is that common biomedical terms are represented as entire tokens.
127
 
128
-
129
  For instance, all of these following terms are tokenized into single tokens by the biomedical tokenizer and multiple tokens by the standard GPT-2 tokenizer:
130
 
131
 
@@ -137,7 +125,6 @@ For instance, all of these following terms are tokenized into single tokens by t
137
  | photosynthesis | photos/ynthesis |
138
  | probiotic | prob/iotic |
139
 
140
-
141
  This allows the model to encode information about these concepts in their individual token representations rather than spread out across subword tokens like “oh” shared with many other terms.
142
 
143
  # Environmental Impact
 
10
 
11
  This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
12
 
 
13
  # Table of Contents
14
 
15
  - [Model Card for Pubmed GPT 2.7B](#model-card-for--model_id-)
 
31
  - [Model Architecture and Objective](#model-architecture-and-objective)
32
  - [Compute Infrastructure](#compute-infrastructure)
33
 
 
 
34
  # Model Details
35
 
36
  ## Model Description
 
58
 
59
  It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be directly used for production or work that may directly impact people.
60
 
 
61
  ## Downstream Use
62
 
63
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
64
 
65
  The main way we have used this model is finetuning for downstream question answering tasks, and we recommend using this model that way.
66
 
 
67
  ## Out-of-Scope Use
68
 
69
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
70
 
71
  We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
72
 
 
73
  # Bias, Risks, and Limitations
74
 
75
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
77
 
78
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
79
 
 
80
  ## Recommendations
81
 
82
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
83
  While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations. Understanding these limitations is especially important in a domain like medicine. Therefore, **we strongly recommend against using this model in production for natural language generation.**
84
 
 
85
  # Training Details
86
 
87
  ## Training Data
 
90
 
91
  This model was trained on the Pubmed Abstracts and Full Text from [The Pile](https://pile.eleuther.ai/).
92
 
 
93
  ## Training Procedure
94
 
95
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
106
 
107
  The training process was very smooth and did not suffer from any divergences.
108
 
 
109
  As we were preparing the training run, we were unsure of the benefits of training out to 300B tokens for language model perplexity and downstream task performance. While most models of this scale (e.g. GPT Neo 2.7B) are trained to 300-400B tokens, the datasets those models use are vastly larger than PubMed. For instance, The Pile is 8x the size of its PubMed subcorpora.
110
 
 
111
  Fortunately, we did continue to see steady perplexity improvements on the validation and training sets for the entirety of training, and preliminary experiments showed improved downstream task performance as we trained out to the full 300B tokens. Our takeaway from this was that it was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models.
112
 
113
  ### Preprocessing
114
 
115
  The model uses a custom tokenizer trained on the PubMed Abstracts. When building domain specific models we have found it important to use a tokenizer trained on in-domain text to maximize performance on downstream tasks. A key benefit is that common biomedical terms are represented as entire tokens.
116
 
 
117
  For instance, all of these following terms are tokenized into single tokens by the biomedical tokenizer and multiple tokens by the standard GPT-2 tokenizer:
118
 
119
 
 
125
  | photosynthesis | photos/ynthesis |
126
  | probiotic | prob/iotic |
127
 
 
128
  This allows the model to encode information about these concepts in their individual token representations rather than spread out across subword tokens like “oh” shared with many other terms.
129
 
130
  # Environmental Impact