hails commited on
Commit
55ab81b
1 Parent(s): 0370d70

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -39
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - pytorch
9
  - causal-lm
10
  - code-generation
 
11
 
12
 
13
  license: apache-2.0
@@ -29,55 +30,28 @@ This is a preliminary release of an experimental artifact and should be treated
29
 
30
 
31
  | Hyperparameter | Value |
32
-
33
-
34
  |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
35
-
36
-
37
  | \\(n_{parameters}\\) | 1,331,810,304 |
38
-
39
-
40
  | \\(n_{layers}\\) | 24 |
41
-
42
-
43
- | \\(d_{model}\\) | 2,048 |
44
-
45
-
46
- | \\(d_{ff}\\) | 8,192 |
47
-
48
-
49
  | \\(n_{heads}\\) | 16 |
50
-
51
-
52
  | \\(d_{head}\\) | 128 |
 
 
 
53
 
54
 
55
- | \\(n_{ctx}\\) | 2,048 |
56
-
57
 
58
- | \\(n_{vocab}\\) | 50256 |
59
 
60
 
61
- | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
62
-
63
-
64
-
65
-
66
-
67
-
68
-
69
- The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
70
-
71
-
72
- dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is used.
73
-
74
-
75
- The model is trained with the same tokenizer as GPT-NeoX-20b (link here), for a vocabulary of 50254 tokens.
76
 
77
 
78
  ## Training Data
79
 
80
- The model was trained on the Pile, an 800Gb dataset composed of varied web corpora. The datasheet and paper for the Pile can be found [here] and [here] respectively
81
 
82
 
83
  ## Training Details
@@ -88,8 +62,7 @@ Following Bavarian et al. 2022, we train the model to additionally perform infil
88
 
89
  Middle segments “to infill” were selected uniformly at random from contexts at the character level, and these contexts were then reformatted as
90
 
91
-
92
- <SUF> {last 1/3rd of the context} <PRE> {first 1/3rd of the context} <MID> {middle 1/3rd of the context} <EOD>
93
 
94
 
95
 
@@ -118,11 +91,11 @@ model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-1.3b")
118
 
119
  Suppose we have some text that we would like to perform infilling on at a certain “cursor location”.
120
 
121
- This would have the form {some prelude text here} <INFILLING LOCATION> {some text following cursor}.
122
 
123
  The way to perform infilling generation would be via placing the input text into this format:
124
 
125
- <SUF> {some text following cursor} <PRE> {some prelude text here} <MID> ... language model output is generated after <MID> token!
126
 
127
 
128
  ## Intended Uses and Limitations
@@ -156,3 +129,5 @@ We also perform preliminary investigation on code generation and infilling capab
156
 
157
 
158
 
 
 
 
8
  - pytorch
9
  - causal-lm
10
  - code-generation
11
+ - The Pile
12
 
13
 
14
  license: apache-2.0
 
30
 
31
 
32
  | Hyperparameter | Value |
 
 
33
  |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 
 
34
  | \\(n_{parameters}\\) | 1,331,810,304 |
 
 
35
  | \\(n_{layers}\\) | 24 |
36
+ | \\(d_{model}\\) | 2048 |
37
+ | \\(d_{ff}\\) | 8192 |
 
 
 
 
 
 
38
  | \\(n_{heads}\\) | 16 |
 
 
39
  | \\(d_{head}\\) | 128 |
40
+ | \\(n_{ctx}\\) | 2048 |
41
+ | \\(n_{vocab}\\) | 50254 |
42
+ | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
43
 
44
 
 
 
45
 
46
+ The model consists of 24 transformer layers with a hidden dimension of 2048, and a feedforward intermediate dimension of 8192. The hidden dimension is split into 16 heads for self-attention, each with a dimension of 128. Rotary Position Embedding (RoPE) is used.
47
 
48
 
49
+ The model is trained with the same tokenizer as [GPT-NeoX-20b](https://arxiv.org/abs/2204.06745), for a vocabulary size of 50254 tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  ## Training Data
53
 
54
+ The model was trained on the Pile, an 800Gb dataset composed of varied web corpora. The datasheet and paper for the Pile can be found [here](https://arxiv.org/abs/2201.07311) and [here](https://arxiv.org/abs/2101.00027) respectively.
55
 
56
 
57
  ## Training Details
 
62
 
63
  Middle segments “to infill” were selected uniformly at random from contexts at the character level, and these contexts were then reformatted as
64
 
65
+ \<SUF\> {last 1/3rd of the context} \<PRE\> {first 1/3rd of the context} \<MID\> {middle 1/3rd of the context} \<EOD\>
 
66
 
67
 
68
 
 
91
 
92
  Suppose we have some text that we would like to perform infilling on at a certain “cursor location”.
93
 
94
+ This would have the form {some prelude text here} \<INFILLING LOCATION\> {some text following cursor}.
95
 
96
  The way to perform infilling generation would be via placing the input text into this format:
97
 
98
+ \<SUF\> {some text following cursor} \<PRE\> {some prelude text here} \<MID\> ... language model output is generated after \<MID\> token!
99
 
100
 
101
  ## Intended Uses and Limitations
 
129
 
130
 
131
 
132
+
133
+