Is this REALLY viable for commercial use 🤔?

#13
by dataviral - opened

Hi based on the description here: https://huggingface.co/tiiuae/falcon-40b-instruct

Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.

The model is licensed under Apache2.0 while also being trained on data from Baize. Baize is under GPL3.0 and it's github (https://github.com/project-baize/baize-chatbot) explicitly states that the data should not be used for commercial application.

Can someone clarify? Also what does mixture of Baize mean?

Thanks

dataviral changed discussion title from Is this REALLY viable for commercial use? to Is this REALLY viable for commercial use 🤔?

I'm not a lawyer and from a glance, it seems Baize is legally vague, because they explicitly licensed GPL 3.0 which DOES allows commercial use and yet they claims on the README that it isn't permitted for commercial use. They need to use the appropriate licensing for their project and ensure the messages are consistent. I would imagine that the "LICENSE" text take precedent over the "README" in court if one were to argue, though again, I'm not a lawyer.

So even if Falcon have to be relicensed for GPL 3.0, it would still be viable for commercial use.

I have the same doubt whether instruct variant is viable option for commercial use or not due to Baize mixture data. On the HF Model page for falcon-40B it is explicitly mentioned as "It is made available under a permissive Apache 2.0 license allowing for commercial use, without any royalties or restrictions." but on instruct variant they just mentioned "Apache 2.0 licence". It will be great if someone from Falcon team clarify this?
@FalconLLM : Looking for your response.

There are several important points here:

  1. The license is clear, if it's licensed as GPL 3 you can use it commercially. The readme does not matter, you are not required to read it either.
  2. You are not using Baize anyway, even if it was non-commercial and not GPL-3 you'd not be using it.
    The Falcon team used it for a fine tune. You can hardly legally claim that fine tuning something makes it fall under that license.
    That's not the case.
  3. If the model was able to reproduce relevant parts of "Baize" then those generations would fall under GPL-3 as well.

I don't see any problem

Thanks for weighing in, the ground is still murky on this to be actually used in real commercial applications by practitioners like myself.
Hoping the @FalconLLM team can clarify

I have the same worry. I think the problem is not that Baize's implementation is GPL-3 licensed. But its dataset is ChatGPT generated. And there is a "terms-of-use" in OpenAI that a clear usage restriction "(iii) use output from the Services to develop models that compete with OpenAI; "
So one should not use output from ChatGPT to develop models that compete with ChatGPT/GPT-4.

That data was only used for instruct. Also you did not break their terms, and you do not have to accept them. It would not be your problem. Also the only risk involved is that when breaking those terms you can lose your account at OpenAI.

Well and if your product claims to be developed by OpenAI you also have a problem, so you’d need to add a content filter removing that word from output.

But seriously: just re-fine tune it ..

Sign up or log in to comment