--- language: - en --- V1 of an English/code tokenizer. Byte-level BPE, 64k vocab. Equal mix between: On the NL side: - Books - C4 - v1 of our CC (helen quality classifier) - enwiki - Gutenberg - Reddit On the code side: - Jupyter notebooks (0.5 weight, it was small) - GH issues - Stackexchange - The cleaned Python Stack For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).