Training dataset?

#26
by sweetamino - opened

Is the training dataset used available anywhere? I'd like to use it to fine tune RWKV.

Pygmalion org

Hey there! So, we use a variety of different sources from our dataset, including community contributions. We do plan to release a portion of these community contributions in the future, so do sit tight for updates on that.

Why only a portion???

I never understand why projects like Pygmalion won't open their datasets so that other language models can be trained with the same dataset. It just feels like you're keeping the dataset to yourself so no one else can produce a competing model using a better LLM as a base.

I've been thinking of starting a project to pull together an open Pygmalion style dataset, as right now, no such dataset exists. As I said in a discussion with someone yesterday, right now, you either have a really good LLM, or an LLM that is fine-tuned for role-playing, not both. There are some really great options like FLAN-T5, OPT, BLOOM, GPT-NeoX and LLaMA which are all good potential candidates for training, but with no dataset available to train them, the GPT-J based Pygmalion and Pyg-based mixes are the only options available.

So that leaves me with two questions:

  1. When are you looking at releasing this dataset?
  2. Will it be enough to realistically train any LLM to handle Pygmalion style role-playing?

It's good to hear you intend to release something, but hearing it's only going to be a portion of the community contributions is very disappointing.

During the creation of our dataset, we gave the option for community contributors for said dataset to be able to have their data private - that is, used only for us and not something that they wished to be released to the public for whatever reasons that may be. When we say "a portion", we refer to the section of the data that was marked by contributors as "I consent to this data to be released to the public" - hence, the public portion. My apologies for not clarifying that earlier. As for "when", we sadly can't give a date on that due to complications with trying to redact any personal information which may have slipped into the public portion of the dataset. As for it being enough, we noted that the section that was marked as public had more data than the section that was marked as private. For all intents and purposes, it should be enough to realistically train any LLM to handle Pyg-style roleplaying.

DISCLAIMER: I'm unaffiliated with this project, but wanted to provide my two cents.

@TearGosling , I've been watching this thread for the past week. Thank you for responding; I appreciate the clarification. I do hope the public dataset used to train Pygmalion is released sooner or later (it doesn't have to be now, of course!). Pygmalion development seems relatively slow compared to the pacing of the rest of the AI world. Its data would be a great opportunity to train LoRAs on not only the recent models (Pythia Deduped, LLaMA), but also the newer ones once they're finished (StableLM, RedPajama).

***

Why only a portion??? [...] It just feels like you're keeping the dataset to yourself

@RazielAU , when you contribute chatlogs to the Pygmalion project, you have the option to choose whether or not they are included in the "public dataset" or remain private and only for use with the Pygmalion project, which may be what they were referring to by a portion of the community contributions. See here. I didn't see TearGosling's most recent response while typing this, but they've confirmed this.

You either have a really good LLM, or an LLM that is fine-tuned for role-playing, not both

According to its model card, Pygmalion 6B is a finetuned GPT-J 6B with a high-quality dataset, so the latter. They have a data toolbox, and their logbooks clarify when certain datasets are added to the model (example). Their transparency can be easily disregarded, but it does exist.

Also, theorizing as to why they haven't released it yet (which is confirmed to be because of privacy), similar chat models were released in the past. c1-6B and its successor convo-6B -- also trained on top of GPT-J 6B, with the latter also having its own data toolbox -- predate Pygmalion, but were later taken down due to privacy concerns (source).

With that said, I do wish you the best of luck with your Pygmalion-inspired dataset project. The more open solutions we have, the merrier.

TearGosling, Okay, that makes sense, nothing to do but wait then.

As for which LLMs I'm rooting for, I honestly think you guys should consider a Flan-T5-XXL (11B) fine-tune, it seems really capable and Google released it under a very permissive license (Apache 2.0). More work is needed on the Oobabooga side to support it, but if push comes to shove I'm sure it wouldn't be too hard to add. I messed around with some example code to see what it can do and it seemed really good.

And thanks @Merry for your insights as well. Based on what TearGosling is saying it sounds like they'll release enough so people can fine tune other LLMs, I'm happy with that, it's all I really want. LLMs are advancing at a rapid pace, and it's really the lack of a good dataset that is currently limiting people from experimenting.

Hey @TearGosling ,

As for "when", we sadly can't give a date on that due to complications with trying to redact any personal information which may have slipped into the public portion of the dataset. As for it being enough, we noted that the section that was marked as public had more data than the section that was marked as private.

If it helps, I've noticed something on the website people use to contribute their chatlogs: https://dump.nopanda.io/

The option to contribute to the public dataset is selected by default on my end, and I have to opt-out. This had also led to me uploading a roleplay to the public dataset by mistake. I'm led to believe many people didn't read the data usage agreement, or thought they were better off not touching the options.

I think it'd be a good idea to set the private dataset to be the default setting, with the public dataset being opt-in. With Pygmalion being a pretty popular project, I don't think you'll have any trouble starting a new dataset with this.

Pygmalion org

Hey, main developer chiming in to clarify a few points. The basic rundown is:

We do intend to release all the data where the contributor has given us their consent to do so. I am not going to leak people's private conversations, hence @TearGosling 's comment mentioning that we're only going to be releasing a portion of it.

The reason I haven't rushed to do this however, is because in the userscript I wrote to actually generate the JSON files that people are contributing, I promised that their personal information was being redacted (username, display name, email and so on). As with all things in the software world though, there were a significant amount of edge cases I did not anticipate, and so PII is present in the dataset despite my efforts.

This has understandably resulted in some people being upset, and several have directly reached out to me asking for their contributions to be removed from the public set. And these are just the people that actually took the time to read the data usage agreement, understand its implications, then scan their submissions to realize that this had happened. I am very certain that a significant portion of people fall under the category that @Merry described above, which is:

The option to contribute to the public dataset is selected by default on my end, and I have to opt-out. This had also led to me uploading a roleplay to the public dataset by mistake. I'm led to believe many people didn't read the data usage agreement

That being the case, I think it'd be very irresponsible on my end to just dump the data as-is out to for the whole internet to scrape and index, because there's no going back from that if I end up harming anyone's privacy. Still, I think a lot of conversational datasets out there are very low quality and releasing our set would be a massive plus to the general community around open chat models, so what I've been trying to do is further redact any PII that I missed before releasing anything. The problem I'm having is that if I'm too strict, there are too many false negatives and the data is not usable for training. If I'm too lax, I'm leaking PII anyways.

I'm working on this but have no ETA because this is not my job. I'm not paid to do this. We all do it for free on our spare time. Hell, multiple companies actually profit off of our free work by charging people to use our models. If my intent was to avoid competition I'd be sticking restrictive licenses on everything, or not even releasing anything to the public at all. So the fact that accusations like

It just feels like you're keeping the dataset to yourself so no one else can produce a competing model

Keep getting thrown against us despite how open we try to be about everything is pretty discouraging.


Also since I'm already here, I might as well respond to a few points:

I think it'd be a good idea to set the private dataset to be the default setting, with the public dataset being opt-in. With Pygmalion being a pretty popular project, I don't think you'll have any trouble starting a new dataset with this.

At this point I'm not adding any new contributions to our training set. CAI has made changes to their front-end that broke my userscript, so anything coming in nowadays is unusable. I also don't think starting from scratch is the right move - I think just being very careful about PII before releasing the existing set is enough. It's unfortunate that some people refused to read the data usage consent agreement (which is literally a couple of sentences) and accidentally contributed to the wrong set, but I don't think that's enough reason to hold back everyone else's contributions.

As for which LLMs I'm rooting for, I honestly think you guys should consider a Flan-T5-XXL (11B) fine-tune, it seems really capable and Google released it under a very permissive license (Apache 2.0).

I've considered it! Unfortunately, it has an absolutely tiny context window which is horrible for long-form chatting. The only Flan with a decent context window is the 20B one, and not only is that model big enough that a lot of people won't be able to use it at all, but it's an encoder-decoder arch which means it doesn't work with Kobold, ooga and a lot of other platforms people use.

As for the "better LLMs" you've mentioned on your first message: OPT has a restrictive license which doesn't fit with the project's ideals of open and freely usable models, BLOOM and Pythia have better equivalents depending on their sizes, and as per the model's license I cannot release a LLaMA fine-tune. Stability's model is atrocious, so I'm currently waiting on RedPajama to see if they can release something competitive. If they can, releasing a RedPajama-based Pyg model will be on my list of TODOs.

Thanks for your response @11b , I didn't realise the context on the normal T5 was so small, there is Long T5 as well which has a context size up to 16k, but that I think that maxes out at 3B in terms of size and I'm not sure those models have been FLAN'd, but that might be a good or bad thing depending on how you look at it since it means you could fine-tune it specifically for roleplaying -> https://huggingface.co/google/long-t5-tglobal-xl

In terms of LLaMA, of all the language models I've tried, it's definitely the best, the issue I see though is that, even IF Facebook decides to release it officially, it's going to be released under the same license as OPT. So, are you still going to train it in that case? As for releasing a LLaMA fine-tune, so far it hasn't really been a problem for projects as people have been releasing their fine-tunes as a diffs or LoRAs so they don't include the original weights. If the project ever does get to the point where a LLaMA based version is trained, I would suggest starting from one of the existing fine-tunes, LLaMA on it's own is like T5 on its own, it needs a bit more training to really wake it up and start getting amazing results.

@11b I understand you don't want to publicly release the dataset due to privacy concerns, and this isn't your full-time job, but have you considered donating or selling the dataset to a trustworthy organization capable of using it for fine-tuning newer models?

"It just feels like you're keeping the dataset to yourself so no one else can produce a competing model"

Keep getting thrown against us despite how open we try to be about everything is pretty discouraging.

there is a reason this is getting thrown at you, because in effect it's true. No one cares about how open you are with your communication, what we care about is the dataset.

From an outsider's perspective: you're sitting on an oilfield, you aren't capable of making use of it yourself, and you refuse to let anyone else make use of it.

Pygmalion org
β€’
edited Apr 23, 2023

I understand you don't want to publicly release the dataset due to privacy concerns, and this isn't your full-time job, but have you considered donating or selling the dataset to a trustworthy organization capable of using it for fine-tuning newer models?

Selling? No, I got the data for free so I feel like that'd be an asshole move. Donating? Yes, and I've already done so to 11 different people who have all said they were going to use it to train and release LLaMA/RWKV/whatever else-based models. None of them have actually released anything to date. And yet, false accusations like:

you're sitting on an oilfield [...] and you refuse to let anyone else make use of it.

there is a reason this is getting thrown at you, because in effect it's true.

Keep getting thrown around. Having to defend myself against this over and over is a waste of time so I'll just close this discussion - everything that needs to be said from my side has already been said. Hope you understand where I'm coming from.

11b changed discussion status to closed

Sign up or log in to comment