Which dataset was used to train this model?

#1
by cekal - opened

The responses seem to be really great! Could you please provide the dataset + the training code? I'd like to try training togethercomputer/RedPajama-INCITE-Base-7B-v0.1 (more permissive license).

It l ooks like data and code for training is provided on the projects github: https://github.com/project-baize/baize-chatbot

The data is 2 months old, they were used for the v1 training. v2 is different.

Is this not the data collector for v2?

54K/57K/47K dialogs from Quora, StackOverFlow and MedQuAD questions
The code for collecting self-chat data: v1, v2
The code for training Baize
The code for chat model demo (forked from ChuanhuChatGPT)

They updated the repo about a week ago so you may have missed it.

I think @cekal means the dataset itself, not the collector.

Sign up or log in to comment