Is the Instruct version also commercial?

#18
by jianguozhang001 - opened

Is this instruct version also apache-2.0?

Yes it says Apache 2.0

@johnwick123forevr Thanks:). I am curious that does the main improvement come from the base model? Given that the instruction model is fine-tuned on a variety of publicly available conversation datasets. However, many public instruction datasets are non-commercial and are not used here.

Yeah the main improvement comes from base model which is probably trained in better data I suppose? Since the base model performs similar as well.

That makes sense although I don't know the data information.

Yeah, they didnt release data yet but might later. Now mistral ai says that they trained it on public data.

Its possible since phi 1.5 for example which was trained on pure gpt4 data(which is high quality) and only 1.5 billion parameters came close to 7b llama v2 and beats llama v1 i believe.

If mistral was trained on high quality data it could be also pretty great?

@johnwick123forevr That's a problem with licensing then, though. Commercial use is explicitly prohibited if you train on OpenAI's models' outputs. So I think @jianguozhang is asking the right question. I asked them on Twitter which datasets they trained on because that in itself determines whether commercial use is allowed or not. I have seen too many other model builders who simply put the apache 2.0 label on their model, even though they train on synthetic OpenAI data. That is not allowed, and leads to people using these models for commercial deployment - potentially leading up to lawsuits by OpenAI because they are using models that are NOT allowed to be used commercially but which were mislicensed by the model builders. This is a dangerous trend... We always yell at BigCorp for transparency, and rightly so. But then many open-source builders are not transparent themselves either, and I suspect often deliberately so to ensure that their models are used by many despite being mislicensed.

True. Of course they might have been not trained on openai data or maybe. They did not release data so we don’t know.

Llama 2 as well is most likely pretrained on openai(with references to openai when outputting text) and other random data that should be very non commercial license. But if they don’t release the data, nothing really happens to them.

I agree. It is likely that the big players are all using data that should not be used in that way (licensed books, GDPR material, OpenAI API, etc.) so it is very frustrating that people who want to be open about what they are doing are restricted in that sense. I just wish everyone, also BigCorp, was more transparent about what they do.

Even openai probably used heavily non commercial books or text. The used a massive part of the web which has lots of parts which is probably non commercial.

There really isn’t an easy way to be fully commercial.

Sign up or log in to comment