Dataset

#3
by ehartford - opened

What dataset was this trained with

Hi @ehartford ,

The dataset we use will be described in our upcoming technical report.
Stay tuned!

deleted

Hi @ehartford ,

The dataset we use will be described in our upcoming technical report.
Stay tuned!

How soon are we getting that?
Can you just reveal the dataset size for now?

How soon are we getting that?

It is estimated to be released to the public next month.

Can you just reveal the dataset size for now?

I'll discuss it with our team and let you know ASAP.

Can you just reveal the dataset size for now?

We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.

deleted

Can you just reveal the dataset size for now?

We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.

Thanx, Sounds interesting, You might have used generalising data too, I guess so

deleted

Can you just reveal the dataset size for now?

We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.

Can you provide me your email or discord, I want to have a talk with you?

Can you provide me your email or discord, I want to have a talk with you?

Sure, you reach us through email yi@01.ai.

This comment has been hidden
This comment has been hidden

Any update on the datasets? We're keeping track of LLM openness at https://opening-up-chatgpt.github.io and Yi 34B Chat is currently in the bottom 5 (out of >30 'open' instruction tuned models) by degrees of openness because so little of source code, training data, instruction tuning etc. is shared or documented.

image.png

ehartford changed discussion status to closed

Sign up or log in to comment