jfrankle commited on
Commit
0a3263c
1 Parent(s): f438eeb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -62,7 +62,7 @@ Apache-2.0 (commercial use permitted)
62
 
63
  ## Documentation
64
 
65
- * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](www.mosaicml.com/blog/mpt-7b)
66
  * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
67
  * Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)!
68
 
@@ -140,14 +140,14 @@ The model was trained for 1T tokens (with batch size 1760 and sequence length 20
140
  | Data Source | Number of Tokens in Source | Proportion | Effective Number of Tokens | Epochs |
141
  |-------------|----------------------------|------------|----------------------------|--------|
142
  | mC4 3.1.0 - English | 417.99 B | 0.33 | 330 B | 0.14 |
143
- | C4 - English - SemDedup 80% | 100.42 B | 0.299 | 299 B | 2.98 |
144
  | RedPajama - CommonCrawl | 878.45 B | 0.1 | 100 B | 0.11 |
145
  | The Stack - Selected Languages | 463.78 B | 0.1 | 100 B | 0.22 |
146
- | RedPajama - Wikipedia | 24.84 B | 0.04 | 40 B | 1.61 |
147
  | The Stack - Markdown | 107.07 B | 0.035 | 35 B | 0.33 |
148
- | S2ORC | 48.85 B | 0.033 | 33 B | 0.68 |
149
- | RedPajama - Books | 26.02 B | 0.03 | 30 B | 1.15 |
150
- | RedPajama - arXiv | 28.10 B | 0.019 | 19 B | 0.04 |
151
  | RedPajama - StackExchange | 20.54 B | 0.014 | 14 B |0.68 |
152
 
153
  Samples for each batch were selected from one of the datasets with the probability specified above.
 
62
 
63
  ## Documentation
64
 
65
+ * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
66
  * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
67
  * Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)!
68
 
 
140
  | Data Source | Number of Tokens in Source | Proportion | Effective Number of Tokens | Epochs |
141
  |-------------|----------------------------|------------|----------------------------|--------|
142
  | mC4 3.1.0 - English | 417.99 B | 0.33 | 330 B | 0.14 |
143
+ | C4 - English - SemDedup 80% | 100.42 B | 0.294 | 294 B | 2.93 |
144
  | RedPajama - CommonCrawl | 878.45 B | 0.1 | 100 B | 0.11 |
145
  | The Stack - Selected Languages | 463.78 B | 0.1 | 100 B | 0.22 |
146
+ | RedPajama - Wikipedia - En | 4.87 B | 0.04 | 40 B | 8.21 |
147
  | The Stack - Markdown | 107.07 B | 0.035 | 35 B | 0.33 |
148
+ | S2ORC | 48.85 B | 0.035 | 35 B | 0.72 |
149
+ | RedPajama - Books | 26.02 B | 0.033 | 33B | 1.27 |
150
+ | RedPajama - arXiv | 28.10 B | 0.019 | 19 B | 0.68 |
151
  | RedPajama - StackExchange | 20.54 B | 0.014 | 14 B |0.68 |
152
 
153
  Samples for each batch were selected from one of the datasets with the probability specified above.