rskuzma commited on
Commit
7e97fa4
1 Parent(s): 5bca29e

update arXiv link

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -12,7 +12,7 @@ pipeline_tag: text-generation
12
  ---
13
 
14
  # Cerebras-GPT 13B
15
- Check out our [Blog Post](https://www.cerebras.net/cerebras-gpt). Our arXiv paper is coming soon!
16
 
17
  ## Model Description
18
 
@@ -98,7 +98,7 @@ print(text_output[0])
98
 
99
  Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology. The Pile was cleaned using the ftfy library to normalize the text, then filtered using scripts provided by Eleuther.
100
 
101
- We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
102
 
103
  Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the Pile dataset size. Pythia was trained on both the standard dataset and deduplicated dataset to characterize the impact. Our models are trained on the standard Pile without deduplication, which may present an opportunity for further improvement with the deduplicated data set.
104
 
 
12
  ---
13
 
14
  # Cerebras-GPT 13B
15
+ Check out our [Blog Post](https://www.cerebras.net/cerebras-gpt) and [arXiv paper](https://arxiv.org/abs/2304.03208)!
16
 
17
  ## Model Description
18
 
 
98
 
99
  Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology. The Pile was cleaned using the ftfy library to normalize the text, then filtered using scripts provided by Eleuther.
100
 
101
+ We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix A.1 of our paper.
102
 
103
  Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the Pile dataset size. Pythia was trained on both the standard dataset and deduplicated dataset to characterize the impact. Our models are trained on the standard Pile without deduplication, which may present an opportunity for further improvement with the deduplicated data set.
104