Exact Training data used?

#1
by nlpguy - opened

Thanks for this amazing model. Is there an exact breakdown by source of the 1T Tokens used for training, or is there a specific collection of public corpuses that were used available?

H2O.ai org

Please take a look at the updated section in the technical report: https://arxiv.org/abs/2401.16818

psinger changed discussion status to closed

Sign up or log in to comment