Alpaca in dataset

#1
by morganpie - opened

Alpaca in the dataset. Is this model still suitable for commercial use?

OpenAssistant org

Ask a lawyer but I think yes. OpenAI trained on huge swathes of the internet with no consideration of license or usage terms of the content, and they use their models commercially. There seems to be no reason using models trained on their outputs would be any different.

This comment has been hidden
OpenAssistant org

OpenAI trained on huge swathes of the internet with no consideration of license or usage terms of the content, and they use their models commercially. There seems to be no reason using models trained on their outputs would be any different.

Sorry, but just because they could be breaking the law, it wouldn't make something legal. They could train on a all kinds of data because the sources are kept secret, so it's really hard to prove they actually gathered sensitive data, and it's even harder to prove where they got that data from in the first place. In our case, all used datasets and their sources are visible, there is no loophole.

However, Alpaca can still be ok, we really could use some help from a pro bono lawyer on data usage.

(c) Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API, use any automated or programmatic method to extract data or output from the Services, including scraping, web harvesting, or web data extraction; (v) represent that output from the Services was human-generated when it is not or otherwise violate our Usage Policies; (vii) buy, sell, or transfer API keys without our prior consent; or (viii), send us any personal information of children under 13 or the applicable age of digital consent. You will comply with any rate limits and other requirements in our documentation. You may use Services only in geographies currently supported by OpenAI.

Isn't this clearly noted in OpenAI's term? A straight NO?

Isn't this clearly noted in OpenAI's term? A straight NO?

I agree with @Alapprentice, if the Alpaca dataset is indeed included in the training dataset then this model should definitely not be used for anything except research purposes.

I don't get why it was included though, as I thought the idea was to not use any OpenAI model in the construction of Open Assistant?

There shouldn't be any alpaca data in the dataset iirc @morganpie ?

There isn't any alpaca data in the dataset @morganpie ? Could you please share your evidence?

@SummerSigh It is clearly labeled in the model card that alpaca is included in the dataset.

OpenAssistant org

Although Ollie makes a good point, it seems as of now that openAI has not enforced its policy.

Although Ollie makes a good point, it seems as of now that openAI has not enforced its policy.

I don't think this is a good argument, as it poses a lawsuit risk if a large company starts using this model. If Alpaca was used to train the model, we should be clear to mark it as non-commercial.

OpenAssistant org
This comment has been hidden

@SummerSigh I don't think it matters whether OpenAI enforces their policy or not. Any legit business violating OpenAI's terms would have a really hard time with their lawyers during the auditing process.

Yes, the Internet license has never been cleared enough. But not every company has the resources like OpenAI to fight for potential lawsuits. Most companies would just avoid troubles like this.

So I think a model pretrained without alpaca would better fit commercial use.

OpenAssistant org

Although Ollie makes a good point, it seems as of now that openAI has not enforced its policy.

I don't think this is a good argument, as it poses a lawsuit risk if a large company starts using this model. If Alpaca was used to train the model, we should be clear to mark it as non-commercial.

There are also several arguments regarding if training a model on licensed data makes that model subject to that license. This mostly centers around the definition of adapted work to which the license applies in terms of making that new work non-commercial. Here is what CC BY-NC 4.0 says regarding that definition:

"Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image."

Now one could argue that our models were "derived from" this data, although there is no easy way to see how. Since oasst uses multiple sources of data, tracking the weights that are affected by this dataset during backprop would be difficult, although not impossible. The simplest way to solve this argument would be to not include this dataset, which is why I was confused. I'm not a lawyer so I may be completely mistaken, but in this case, it will come down to arguing the model falls under Adapted Material.

OpenAssistant org

Are you guys out of your god damn minds?

Are you seriously PUBLICLY debating wether or not we should WILLINGLY BREAK THE LAW because it might not be enforced any time in the future?

Do you really want to drag OA to the ground by allowing these companies to dig up dirt on it? Do you want people to gossip about how the model only works because of data from OpenAI? Do you want people to never believe that everything we do is truly open source just because "there are benefits to using the data"?

Are you willing to put EVERY SINGLE USER in risk of getting their future models banned because they unknowingly used out models that was never really open source in the first place, just because "one could argue it's from a derived model"?

Fucking THINK

@SummerSigh I don't think it matters whether OpenAI enforces their policy or not. Any legit business violating OpenAI's terms would have a really hard time with their lawyers during the auditing process.

Yes, the Internet license has never been cleared enough. But not every company has the resources like OpenAI to fight for potential lawsuits. Most companies would just avoid troubles like this.

So I think a model pretrained without alpaca would better fit commercial use.

As mentioned, feel free to confirm with a lawyer, but a company using this model would not seem to be violating OpenAI's terms.

First, any company who has not signed up to an OpenAI account has not even agreed to the terms in the first place.

Second, even if they have agreed to the terms, the terms say "use output from the Services to develop models". Downloading this model and running inference with it, or finetuning it with non-OpenAI data, does not seem to violate that rule.

This is even after making the extremely generous assumption that the part of the OpenAI ToS in question is legally enforceable. It would seem OpenAI are already aware that it isn't, which is likely why they are not attempting to enforce it on anyone, even Google when they trained on ChatGPT outputs.

OpenAssistant org

If people are concerned about an extremely tiny part of the total training mass called "Alpaca" ... then they should maybe also look at all the TBs of material in the Pile on which Pythia was trained... Same applies for basically all existing LLMs (very likely also OpenAI's models) ... falling on the knees in front of OpenAI but happily processing images and texts of millions of other humans (artist, software developers etc.) seems unreasonable to me. Someone first needs to explain at court that running a single gradient descent step for presented data actually is a copyright infringement .
And btw what would that mean in general? The internet will soon be flooded with GPT-4 output (more or less edited by humans) .. does that mean OpenAI can now claim ownership of the whole internet because the "infected" it with data that was produced by their service? Would be a pretty bad outcome if all future Word documents produced by Clippy 2.0 would effectively be "owned" by Microsoft...

@sedthh I don't understand where your frustration comes from.

First of all, every single person in this thread appreciates Open Assistant's work, and we appreciate the effort from the community. We also want to incorporate the model one way or another.

Open Assistant claims to be open sourced, and advertised its DIFFERENCE from OpenAI. Am I correct?? If so, why can't we openly discuss potential issue that may causes problems to the users? Every user deserves to know exactly what they get, right?

So you believe if we all keep the mouth shut, we can go under and any potential problem can be covered?? lol

@andreaskoepf I have no idea what licenses the TBs of material in the Pile are under. But as @SummerSigh pointed out, Alpaca dataset is under the CC BY NC 4.0 (allowing for non-commericial use).

I just think it would save a lot of trouble by training the pythia-based model on datasets excluding alpaca, so we can be on the same boat as almost all other "open-sourced" models. Not sure if this will jeopardize the performance by a large margin though.

OpenAssistant org

@andreaskoepf I have no idea what licenses the TBs of material in the Pile are under. But as @SummerSigh pointed out, Alpaca dataset is under the CC BY NC 4.0 (allowing for non-commericial use).

I just think it would save a lot of trouble by training the pythia-based model on datasets excluding alpaca, so we can be on the same boat as almost all other "open-sourced" models. Not sure if this will jeopardize the performance by a large margin though.

The idea that @andreaskoepf is alluding to is that many of the texts in the pile are also under non-commercial licenses, as well as a myriad of other licenses all with their independent quirks. Putting this in context, almost all LLMs are trained on texts that are under various licenses including ones that are non-commercial. Pythia, which is trained on the Pile, also has the same issues that we are discussing now. This applies to almost every single LLM in existence.

Confirming what @AIapprentice is saying is true... Any business of significant size (i.e. with lawyers) are going to say NO until a precedent is clearly set.

My guess is also that if anyone could easily generate a dataset from any commercial model and easily create their own copy of the service that model is providing then the idea of using datasets in this way will likely not fly in court.

My vote would be for taking Alpaca out.

The idea that @andreaskoepf is alluding to is that many of the texts in the pile are also under non-commercial licenses, as well as a myriad of other licenses all with their independent quirks. Putting this in context, almost all LLMs are trained on texts that are under various licenses including ones that are non-commercial. Pythia, which is trained on the Pile, also has the same issues that we are discussing now. This applies to almost every single LLM in existence.

I understand this. But there’s a slight difference here. Alpaca is from chatGPT output, and we are talking about dealing with a private company that’s in competition (sort of). It’s different from say a wikipedia dataset that’s under non-commercial license.

If we remove alpaca, we would be on the same boat as Google and Meta, since they all release “open source” models based on those piles of data. Even OpenAI may not pursue anything since GPT is also based on those data.

It won’t solve the issue completely, but it will the DD laywers less annoying.

OpenAssistant org
edited Apr 17, 2023

@AIapprentice So you believe if we all keep the mouth shut, we can go under and any potential problem can be covered?? lol

wonderful, thank you for willfully misquoting me as an obvious trollbait

now we can all disregard your obviously harmful opinions with ease

@sedthh Sure thing. You're welcome :)

Dude, you are welcome to disregard whatever I said. Feel free to unfollow the thread if you want my opinion to disappear. lol~

OpenAssistant org

"The Pile: An 800GB Dataset of Diverse Text for Language Modeling"
https://arxiv.org/abs/2101.00027

"7.1 Legality of Content
While the machine learning community has be-
gun to discuss the issue of the legality of training
models on copyright data, there is little acknowl-
edgment of the fact that the processing and dis-
tribution of data owned by others may also be a
violation of copyright law. As a step in that direc-
tion, we discuss the reasons we believe that our
use of copyright data is in compliance with US
copyright law.16
Under pre (1984) (and affirmed in subsequent
rulings such as aff (2013); Google (2015)), non-
commercial, not-for-profit use of copyright media
is preemptively fair use. Additionally, our use is
transformative, in the sense that the original form
of the data is ineffective for our purposes and our
form of the data is ineffective for the purposes of
the original documents. Although we use the full
text of copyright works, this is not necessarily dis-
qualifying when the full work is necessary (ful,
2003). In our case, the long-term dependencies in
natural language require that the full text be used in
order to produce the best results (Dai et al., 2019;
Rae et al., 2019; Henighan et al., 2020; Liu et al.,
2018).
Copyright law varies by country, and there may be
additional restrictions on some of these works in
particular jurisdictions. To enable easier compli-
ance with local laws, the Pile reproduction code is
available and can be used to exclude certain com-
ponents of the Pile which are inappropriate for
the user. Unfortunately, we do not have the meta-
data necessary to determine exactly which texts are
copyrighted, and so this can only be undertaken at
the component level. Thus, this should be be taken
to be a heuristic rather than a precise determina-
tion."

Also, in case people in this thread are not aware, there is also this original OA Pythia 12B which was not trained using Alpaca (although it also used an earlier and smaller version of the OA dataset).

Would it be possible to offer another fine-tuned version of the Pythia 12B base model, this time without Alpaca? That way, people who want to use the fine-tuned model commercially and are not allowed (by their legal departments) to use Alpaca can choose to use the Alpaca-free model.

@markusdr and others who are interested:

H2OAI has now released Pythia models and GPT-neoX models trained on (only) the Open Assistant dataset, making them fit for commercial use:

Pythia 12B: https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b
GPT-neoX: https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b

@saattrupdan Thank you! This is awesome, can't wait to test them out.

Sign up or log in to comment