shanearora commited on
Commit
1ce116f
·
verified ·
1 Parent(s): 1f42066

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -32
README.md CHANGED
@@ -113,7 +113,7 @@ For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?
113
 
114
  <!-- This section describes the evaluation protocols and provides the results. -->
115
 
116
- Core model results for the new and original 7B model are found below.
117
 
118
  | Task | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
119
  |-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
@@ -131,9 +131,9 @@ Core model results for the new and original 7B model are found below.
131
  | GSM8k | 10.0 | 12.0 | 4.0 | 4.5 | 8.5 | 25.0 | 29.0 | 35.0 |
132
  | Full average | 60.3 | 62.1 | 59.2 | 59.3 | 59.8 | 66.2 | 63.8 | 64.2 |
133
 
134
- And for the 1B model:
135
 
136
- | task | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
137
  | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
138
  | arc_challenge | 25 | 43.81 | 33.11 | 34.78 | 34.45 | 36.5 |
139
  | arc_easy | 25 | 63.68 | 50.18 | 53.16 | 58.07 | 55.3 |
@@ -167,22 +167,22 @@ Both stages contribute equally to the final performance of the OLMo model. After
167
 
168
  OLMo 7B architecture with peer models for comparison.
169
 
170
- | | **OLMo 7B** | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
171
- |------------------------|-------------------|---------------------|--------------------|--------------------|------------------|
172
- | d_model | 4096 | 4096 | 4096 | 4544 | 4096 |
173
- | num heads | 32 | 32 | 32 | 71 | 16 |
174
- | num layers | 32 | 32 | 32 | 32 | 32 |
175
- | MLP ratio | ~8/3 | ~8/3 | ~8/3 | 4 | 4 |
176
- | LayerNorm type | non-parametric LN | RMSNorm | parametric LN | parametric LN | parametric LN |
177
- | pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE |
178
- | attention variant | full | GQA | full | MQA | MQA |
179
- | biases | none | none | in LN only | in LN only | none |
180
- | block type | sequential | sequential | sequential | parallel | parallel |
181
- | activation | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
182
- | sequence length | 2048 | 4096 | 2048 | 2048 | 2048 |
183
- | batch size (instances) | 2160 | 1024 | 2048 | 2304 | 512 |
184
- | batch size (tokens) | ~4M | ~4M | ~4M | ~4M | ~1M |
185
- | weight tying | no | no | no | no | yes |
186
 
187
 
188
  ### Hyperparameters
@@ -192,23 +192,23 @@ AdamW optimizer parameters are shown below.
192
  | Size | Peak LR | Betas | Epsilon | Weight Decay |
193
  |------|------------|-----------------|-------------|--------------|
194
  | 1B | 4.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
195
- | 7B | 3.0E-4 | (0.9, 0.99) | 1.0E-5 | 0.1 |
196
 
197
  Optimizer settings comparison with peer models.
198
 
199
- | | **OLMo 7B** | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
200
  |-----------------------|------------------|---------------------|--------------------|--------------------|
201
- | warmup steps | 5000 | 2000 | 2000 | 1000 |
202
- | peak LR | 3.0E-04 | 3.0E-04 | 3.0E-04 | 6.0E-04 |
203
- | minimum LR | 3.0E-05 | 3.0E-05 | 3.0E-05 | 1.2E-05 |
204
- | weight decay | 0.1 | 0.1 | 0.1 | 0.1 |
205
- | beta1 | 0.9 | 0.9 | 0.9 | 0.99 |
206
- | beta2 | 0.95 | 0.95 | 0.95 | 0.999 |
207
- | epsilon | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 |
208
- | LR schedule | linear | cosine | cosine | cosine |
209
- | gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
210
- | gradient reduce dtype | FP32 | FP32 | FP32 | BF16 |
211
- | optimizer state dtype | FP32 | most likely FP32 | FP32 | FP32 |
212
 
213
 
214
 
 
113
 
114
  <!-- This section describes the evaluation protocols and provides the results. -->
115
 
116
+ Core model results for OLMo 7B models are found below.
117
 
118
  | Task | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
119
  |-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
 
131
  | GSM8k | 10.0 | 12.0 | 4.0 | 4.5 | 8.5 | 25.0 | 29.0 | 35.0 |
132
  | Full average | 60.3 | 62.1 | 59.2 | 59.3 | 59.8 | 66.2 | 63.8 | 64.2 |
133
 
134
+ And for 1B models:
135
 
136
+ | task | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1.0 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
137
  | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
138
  | arc_challenge | 25 | 43.81 | 33.11 | 34.78 | 34.45 | 36.5 |
139
  | arc_easy | 25 | 63.68 | 50.18 | 53.16 | 58.07 | 55.3 |
 
167
 
168
  OLMo 7B architecture with peer models for comparison.
169
 
170
+ | | **OLMo 7B July 2024** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
171
+ |------------------------|-------------------|-------------------|---------------------|--------------------|------------------|
172
+ | d_model | 4096 | 4096 | 4096 | 4096 | 4544 | 4096 |
173
+ | num heads | 32 | 32 | 32 | 32 | 71 | 16 |
174
+ | num layers | 32 | 32 | 32 | 32 | 32 | 32 |
175
+ | MLP ratio | ~8/3 | ~8/3 | ~8/3 | ~8/3 | 4 | 4 |
176
+ | LayerNorm type | non-parametric LN | non-parametric LN | RMSNorm | parametric LN | parametric LN | parametric LN |
177
+ | pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE | RoPE |
178
+ | attention variant | full | full | GQA | full | MQA | MQA |
179
+ | biases | none | none | none | in LN only | in LN only | none |
180
+ | block type | sequential | sequential | sequential | sequential | parallel | parallel |
181
+ | activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
182
+ | sequence length | 4096 | 2048 | 4096 | 2048 | 2048 | 2048 |
183
+ | batch size (instances) | 1024 | 2160 | 1024 | 2048 | 2304 | 512 |
184
+ | batch size (tokens) | ~4M | ~4M | ~4M | ~4M | ~4M | ~1M |
185
+ | weight tying | no | no | no | no | no | yes |
186
 
187
 
188
  ### Hyperparameters
 
192
  | Size | Peak LR | Betas | Epsilon | Weight Decay |
193
  |------|------------|-----------------|-------------|--------------|
194
  | 1B | 4.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
195
+ | 7B | 3.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
196
 
197
  Optimizer settings comparison with peer models.
198
 
199
+ | | **OLMo 7B July 2024** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
200
  |-----------------------|------------------|---------------------|--------------------|--------------------|
201
+ | warmup steps | 2500 | 5000 | 2000 | 2000 | 1000 |
202
+ | peak LR | 3.0E-04 | 3.0E-04 | 3.0E-04 | 3.0E-04 | 6.0E-04 |
203
+ | minimum LR | 3.0E-05 | 3.0E-05 | 3.0E-05 | 3.0E-05 | 1.2E-05 |
204
+ | weight decay | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
205
+ | beta1 | 0.9 | 0.9 | 0.9 | 0.9 | 0.99 |
206
+ | beta2 | 0.95 | 0.95 | 0.95 | 0.95 | 0.999 |
207
+ | epsilon | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 |
208
+ | LR schedule | cosine | linear | cosine | cosine | cosine |
209
+ | gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
210
+ | gradient reduce dtype | FP32 | FP32 | FP32 | FP32 | BF16 |
211
+ | optimizer state dtype | FP32 | FP32 | most likely FP32 | FP32 | FP32 |
212
 
213
 
214