Open sourcing the instruction fine-tuning dataset and help on tokenizer

#2
by aayush14 - opened

Hello,

Firstly, thank you very much for sharing this work. I wanted to ask if you plan on open-sourcing the dataset used in stage 2, for the instruction fine-tuning? I think it would be very helpful for the community.

Also, did you create a tokenizer from scratch or built upon the base tokenizer? What corpus did you use for it and how did you choose the parameters and which special tokens should be added?

Best Regards

Thanks for your recognition!
We have open-sourced the instruction dataset.
As for tokenizer, we use the original tokenizer of Llama2 without any modification, at least in this version.
Any notification of updates will be posted on my github repo.

Best Regards

Sign up or log in to comment