avi-skowron commited on
Commit
d37e168
β€’
1 Parent(s): f20536c

fix batch sizes and add paper

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -7,11 +7,12 @@ tags:
7
  - pythia
8
  license: apache-2.0
9
  datasets:
10
- - the_pile
11
  ---
12
 
13
  The *Pythia Scaling Suite* is a collection of models developed to facilitate
14
- interpretability research. It contains two sets of eight models of sizes
 
15
  70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
16
  models: one trained on the Pile, and one trained on the Pile after the dataset
17
  has been globally deduplicated. All 8 model sizes are trained on the exact
@@ -53,6 +54,8 @@ with exact parameter counts.
53
  - Language: English
54
  - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
55
  for training procedure, config files, and details on how to use.
 
 
56
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
57
  - License: Apache 2.0
58
  - Contact: to ask questions about this model, join the [EleutherAI
@@ -66,10 +69,10 @@ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
66
  | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
67
  | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
68
  | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | β€” |
69
- | 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
70
- | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
71
  | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | β€” |
72
- | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
73
  | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
74
  | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
75
  | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | β€” |
 
7
  - pythia
8
  license: apache-2.0
9
  datasets:
10
+ - EleutherAI/pile
11
  ---
12
 
13
  The *Pythia Scaling Suite* is a collection of models developed to facilitate
14
+ interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).
15
+ It contains two sets of eight models of sizes
16
  70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
17
  models: one trained on the Pile, and one trained on the Pile after the dataset
18
  has been globally deduplicated. All 8 model sizes are trained on the exact
 
54
  - Language: English
55
  - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
56
  for training procedure, config files, and details on how to use.
57
+ [See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
58
+ details.
59
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
60
  - License: Apache 2.0
61
  - Contact: to ask questions about this model, join the [EleutherAI
 
69
  | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
70
  | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
71
  | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | β€” |
72
+ | 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
73
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10<sup>-4</sup> | OPT-350M |
74
  | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | β€” |
75
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
76
  | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
77
  | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
78
  | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | β€” |