@BramVanroy on Hugging Face: "🥳 New license for datasets: Apache 2.0! I have been struggling mentally for…"

BramVanroy

posted an update Apr 19, 2024

Post

2295

🥳 New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! 🥳

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?

JorgeDeC

Apr 19, 2024

Great, thank you very much!
We were in the process of translating the original ultrachat en ultrafeedback dataset to Dutch ourselves using permissible models for commercial use.

But now we don't have to. Looking forward to using this!

BramVanroy

Apr 19, 2024

Cool! Looking forward to what you'll build with this!

Rijgersberg

Apr 19, 2024

•

edited Apr 19, 2024

If you decide not to change the license on the GPT3.5 datasets, you should at least market the hell out of the compliance of your models.

There is a school of thought that basically considers every model from GPT2 onwards an abomination in the legal sense. They are waiting for the coming (AI Act) Reckoning that will wipe the competitive field clean of the current ruling class of models and players.

I personally don't necessarily subscribe to that, but in that scenario strict compliance is a major competitive advantage. You should exploit it to the fullest.

See https://huggingface.co/blog/Pclanglais/common-corpus and https://twitter.com/Dorialexander for inspiration.

BramVanroy

Apr 19, 2024

What do you mean with compliance in this context? I'm not sure how I can market being non-commercial as a good thing 😅

Join the conversation