GPT-4 Output in Training Data

#10
by alfredplpl - opened

The model’s license is listed as Apache 2.0, but the training data (such as teknium/OpenHermes-2.5) includes outputs from GPT-4. How should this be interpreted, especially in terms of commercial use?

Or is the output from GPT-4 being removed from the training data?

Hi @alfredplpl ,

Most current models are necessarily trained in a way on GPT4 outputs.
Simply because it's all over the internet, and thus present in the recent Common Crawl dumps.

If you really care about that, you can start from the base Idefics2-base which has not been trained on GPT4 outputs and redo the fine-tuning on the data you want.

Thank you for your response. I see, that's an interesting perspective. No problem at all. Thank you also for informing me about how to handle it.

alfredplpl changed discussion status to closed

Sign up or log in to comment