Change Dataset for licence reason.

#2
by Blublu3D - opened

@TheBloke Please, hello, is it possible to have a GPTQ version of Vigogne 2 7B and 13B with this dataset :
https://huggingface.co/datasets/Kant1/French_Wikipedia_articles
My hardware does not allow me to do the conversion myself πŸ˜–

The diverse_french_news dataset is unlicensed but is problematic.

Tank you ! πŸ€—πŸ₯°

I think you're confusing the GPTQ calibration dataset for the model training dataset.

The model wasn't trained on Diverse French News. That's the dataset I picked, pretty much at random (first French dataset I found), for the GPTQ calibration process. No part of that dataset is in the model, it's just used to calibrate the GPTQ quantisation to get a higher quality quantisation.

As for which dataset(s) were used to train the Vigogne models, I believe you can see them in their Github, here: https://github.com/bofenghuang/vigogne/tree/main/data

They have just dumped JSON files into their Github, rather than linking to datasets on Hugging Face, so you'll need to figure out what those datasets are exactly and therefore what licenses they use.

If you would like a Vigogne model trained on a different dataset, please contact the Vigogne team.

Thank you for your response and your great work.
I understood the different parts of the LLM.

  • The base model is Llama 2. - This model is OK for commercial use.
  • It was finetune by bofenghuang/vicogne-2-13b-instruct, bofenghuang uses open source and self-data for training - This model is OK for commercial use.
    The licensing problem for commercial use is with diverse_french_news.

Yes the data is not included in the models but it is based on data which is not free. And the legal question is unclear.
To avoid any problems, could you do some training with French_Wikipedia_articles ? Please πŸ™‚

Sign up or log in to comment