Papers
arxiv:2201.07311

Datasheet for the Pile

Published on Jan 13, 2022
Authors:
,

Abstract

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

Community

Sign up or log in to comment

Models citing this paper 132

Browse 132 models citing this paper

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 281

Collections including this paper 1