Loubna ben allal
add files
c9e8e4a
raw
history blame
469 Bytes
[OPT](https://huggingface.co/facebook/opt-30b) was trained on the following 5 filtered datasets of textual documents, one of them includes code, [The Pile](https://arxiv.org/pdf/2101.00027v1.pdf), it used *Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews*.
The final training data contains 180B tokens corresponding to 800GB of data. For more details please refer to this [paper](https://arxiv.org/abs/2205.01068)