allenai/c4
Viewer • Updated • 10.4B • 1.18M • 602
This tokenizer is released under the Apache License 2.0.
This tokenizer was trained on a mixture of publicly available English text datasets:
FineWeb, FineWeb-Edu, and FinePDFs-Edu are released under the Open Data Commons Attribution License ODC-By v1.0 and are subject to Common Crawl Terms of Use.