allenai
/

OLMo-7B-0724-hf

@@ -113,7 +113,7 @@ For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?
 <!-- This section describes the evaluation protocols and provides the results. -->
-Core model results for the new and original 7B model are found below.
 | Task              | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
 |-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
@@ -131,9 +131,9 @@ Core model results for the new and original 7B model are found below.
 | GSM8k             | 10.0     | 12.0      | 4.0       | 4.5    | 8.5     | 25.0       | 29.0               | 35.0                  |
 | Full average      | 60.3     | 62.1      | 59.2      | 59.3   | 59.8    | 66.2       | 63.8               | 64.2                  |
-And for the 1B model:
-| task       | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
 | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
 | arc_challenge | 25     | 43.81             | 33.11     | 34.78                                  | 34.45   | 36.5 |
 | arc_easy      | 25     | 63.68             | 50.18     | 53.16                                  | 58.07   | 55.3 |
@@ -167,22 +167,22 @@ Both stages contribute equally to the final performance of the OLMo model. After
 OLMo 7B architecture with peer models for comparison.
-|                        | **OLMo 7B**   | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
-|------------------------|-------------------|---------------------|--------------------|--------------------|------------------|
-| d_model                | 4096              | 4096                | 4096               | 4544               | 4096             |
-| num heads              | 32                | 32                  | 32                 | 71                 | 16               |
-| num layers             | 32                | 32                  | 32                 | 32                 | 32               |
-| MLP ratio              | ~8/3         | ~8/3           | ~8/3          | 4                  | 4                |
-| LayerNorm type         | non-parametric LN | RMSNorm             | parametric LN      | parametric LN      | parametric LN    |
-| pos embeddings         | RoPE              | RoPE                | RoPE               | RoPE               | RoPE             |
-| attention variant      | full              | GQA                 | full               | MQA                | MQA              |
-| biases                 | none              | none                | in LN only         | in LN only         | none             |
-| block type             | sequential        | sequential          | sequential         | parallel           | parallel         |
-| activation             | SwiGLU            | SwiGLU              | SwiGLU             | GeLU               | SwiGLU           |
-| sequence length        | 2048              | 4096                | 2048               | 2048               | 2048             |
-| batch size (instances) | 2160              | 1024                | 2048               | 2304               | 512              |
-| batch size (tokens)    | ~4M          | ~4M            | ~4M           | ~4M           | ~1M         |
-| weight tying           | no                | no                  | no                 | no                 | yes              |
 ### Hyperparameters
@@ -192,23 +192,23 @@ AdamW optimizer parameters are shown below.
 | Size | Peak LR    | Betas           | Epsilon     | Weight Decay |
 |------|------------|-----------------|-------------|--------------|
 | 1B   | 4.0E-4   | (0.9, 0.95)   | 1.0E-5    | 0.1          |
-| 7B   | 3.0E-4   | (0.9, 0.99)   | 1.0E-5    | 0.1          |
 Optimizer settings comparison with peer models.
-|                       | **OLMo 7B**  | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
 |-----------------------|------------------|---------------------|--------------------|--------------------|
-| warmup steps          | 5000             | 2000                | 2000               | 1000               |
-| peak LR               | 3.0E-04 | 3.0E-04    | 3.0E-04   | 6.0E-04  |
-| minimum LR            | 3.0E-05 | 3.0E-05    | 3.0E-05   | 1.2E-05   |
-| weight decay          | 0.1              | 0.1                 | 0.1                | 0.1                |
-| beta1                 | 0.9              | 0.9                 | 0.9                | 0.99               |
-| beta2                 | 0.95             | 0.95                | 0.95               | 0.999              |
-| epsilon               | 1.0E-05 | 1.0E-05    | 1.0E-05   | 1.0E-05   |
-| LR schedule           | linear           | cosine              | cosine             | cosine             |
-| gradient clipping     | global 1.0       | global 1.0          | global 1.0         | global 1.0         |
-| gradient reduce dtype | FP32             | FP32                | FP32               | BF16               |
-| optimizer state dtype | FP32             | most likely FP32    | FP32               | FP32               |

 <!-- This section describes the evaluation protocols and provides the results. -->
+Core model results for OLMo 7B models are found below.
 | Task              | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
 |-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
 | GSM8k             | 10.0     | 12.0      | 4.0       | 4.5    | 8.5     | 25.0       | 29.0               | 35.0                  |
 | Full average      | 60.3     | 62.1      | 59.2      | 59.3   | 59.8    | 66.2       | 63.8               | 64.2                  |
+And for 1B models:
+| task       | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1.0 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
 | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
 | arc_challenge | 25     | 43.81             | 33.11     | 34.78                                  | 34.45   | 36.5 |
 | arc_easy      | 25     | 63.68             | 50.18     | 53.16                                  | 58.07   | 55.3 |
 OLMo 7B architecture with peer models for comparison.
+|                        | **OLMo 7B July 2024** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
+|------------------------|-------------------|-------------------|---------------------|--------------------|------------------|
+| d_model                | 4096              | 4096              | 4096                | 4096               | 4544               | 4096             |
+| num heads              | 32                | 32                | 32                  | 32                 | 71                 | 16               |
+| num layers             | 32                | 32                | 32                  | 32                 | 32                 | 32               |
+| MLP ratio              | ~8/3              | ~8/3              | ~8/3                | ~8/3               | 4                  | 4                |
+| LayerNorm type         | non-parametric LN | non-parametric LN | RMSNorm             | parametric LN      | parametric LN      | parametric LN    |
+| pos embeddings         | RoPE              | RoPE              | RoPE                | RoPE               | RoPE               | RoPE             |
+| attention variant      | full              | full              | GQA                 | full               | MQA                | MQA              |
+| biases                 | none              | none              | none                | in LN only         | in LN only         | none             |
+| block type             | sequential        | sequential        | sequential          | sequential         | parallel           | parallel         |
+| activation             | SwiGLU            | SwiGLU            | SwiGLU              | SwiGLU             | GeLU               | SwiGLU           |
+| sequence length        | 4096              | 2048              | 4096                | 2048               | 2048               | 2048             |
+| batch size (instances) | 1024              | 2160              | 1024                | 2048               | 2304               | 512              |
+| batch size (tokens)    | ~4M               | ~4M               | ~4M                 | ~4M                | ~4M                | ~1M              |
+| weight tying           | no                | no                | no                  | no                 | no                 | yes              |
 ### Hyperparameters
 | Size | Peak LR    | Betas           | Epsilon     | Weight Decay |
 |------|------------|-----------------|-------------|--------------|
 | 1B   | 4.0E-4   | (0.9, 0.95)   | 1.0E-5    | 0.1          |
+| 7B   | 3.0E-4   | (0.9, 0.95)   | 1.0E-5    | 0.1          |
 Optimizer settings comparison with peer models.
+|                       | **OLMo 7B July 2024**  | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
 |-----------------------|------------------|---------------------|--------------------|--------------------|
+| warmup steps          | 2500             | 5000             | 2000                | 2000               | 1000               |
+| peak LR               | 3.0E-04          | 3.0E-04          | 3.0E-04             | 3.0E-04            | 6.0E-04            |
+| minimum LR            | 3.0E-05          | 3.0E-05          | 3.0E-05             | 3.0E-05            | 1.2E-05            |
+| weight decay          | 0.1              | 0.1              | 0.1                 | 0.1                | 0.1                |
+| beta1                 | 0.9              | 0.9              | 0.9                 | 0.9                | 0.99               |
+| beta2                 | 0.95             | 0.95             | 0.95                | 0.95               | 0.999              |
+| epsilon               | 1.0E-05          | 1.0E-05          | 1.0E-05             | 1.0E-05            | 1.0E-05            |
+| LR schedule           | cosine           | linear           | cosine              | cosine             | cosine             |
+| gradient clipping     | global 1.0       | global 1.0       | global 1.0          | global 1.0         | global 1.0         |
+| gradient reduce dtype | FP32             | FP32             | FP32                | FP32               | BF16               |
+| optimizer state dtype | FP32             | FP32             | most likely FP32    | FP32               | FP32               |