Dataset
What dataset was this trained with
Hi @ehartford ,
The dataset we use will be described in our upcoming technical report.
Stay tuned!
How soon are we getting that?
Can you just reveal the dataset size for now?
How soon are we getting that?
It is estimated to be released to the public next month.
Can you just reveal the dataset size for now?
I'll discuss it with our team and let you know ASAP.
Can you just reveal the dataset size for now?
We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.
Any update on the datasets? We're keeping track of LLM openness at https://opening-up-chatgpt.github.io and Yi 34B Chat is currently in the bottom 5 (out of >30 'open' instruction tuned models) by degrees of openness because so little of source code, training data, instruction tuning etc. is shared or documented.