The V4 is here

#11
by TheYuriLover - opened

Hey!

https://huggingface.co/datasets/gozfarb/ShareGPT_Vicuna_unfiltered

@gozfarb decided to make a V4 version of the unfiltered dataset, I think it's the one that removed almost all the woke in it, you should make a new finetune with the 1.1 version in it! :D

deleted

Yep, it's got all the stuff we found in discussion 7 on the main dataset repo.

Hit me up if there's anything else you want trimmed or if there are any problems with the dataset otherwise.

Thanks! I've already updated to 1.1 and queued up the jobs to train with the V4 from @gozfarb. It will probably take a few days though before the results come, the GPU cluster is quite busy at the moment. But in the meantime if there are further updates to the dataset or an official version from @anon8231489123 , I can still swap the dataset before the jobs start. Personally I think everyone already did great job preparing the V4, and I couldn't really find anything to add myself.

I know I'm asking a lot but there's this SuperCOT Lora that is giving the model some really high quality outputs, it's like even better than the best finetunes.
If there was a finetune of this, I really believe we could increase the quality of the llama model from another level, and for the moment you seem to be the only dude that can do this task.
https://huggingface.co/kaiokendev/SuperCOT-LoRA

Tbh you could even mix those dataset with V4 vicuna and make like the ultimate finetune idk lmao

Could add it on the list :). Maybe if someone would be willing to prepare a dataset for it (or do it collaboratively). Yeah, one could combine some interesting things and we could see for fun what sort of finetune we can put together. Let's complete the surgery on vicuna first, who knows that could then act as a base for improvements.

That's a great idea actually, if we went to the conclusion that the V4_vicuna model is good enough then we can train on it with more stuff in the future, but all of these gotta be on the same instruction format though, or else the poor model will be confused... and us too lmao

deleted

I agree with getting base Vicuna right first. We shouldn't change too many variables at once.

As to cleaning up the datasets, I think @kaiokendev might have said he did some cleaning to the datasets he linked. If he did, hopefully he can share them or clarify that he didn't edit them.

Assuming that (or the raw ones being used), the datasets in question would be pretty easy to convert to Vicuna format. I could write a script to do that with the caveat that they would all be single question/answer conversations. I don't know what Vicuna would do with that or if it would affect output quality for longer conversations since I'm not super familiar with how much the conversation structure itself matters for weighting the finetune. Maybe it's just concerned with lowering weights on duplicate answers in the same conversation? Dunno.

The best advantage of @kaiokendev 's dataset is that it's totally unwoke, when I tried the SuperCOT Lora I had no instance of refusal or moralizing stuff, and that's a good thing! And yeah, if he did clean the original dataset to make a big one at the end, if he could share it to us that would be cool.

Yep the format is the thing we would have to ensure and most likely work on, if we ever went that route. Let's see how the V4 turns out. If at all possible for me, I could then do 30b, and in the meantime some other work might take place to prepare the next best thing dataset.

The data set I used had partial samples of the datasets listed, and ran multiple passes of filters, including manually reviewing the file. The current version I have has anything with tweets or hashtags removed though, which is different from the uploaded loras. I can upload this later today. Then once you are ready you can do whatever

deleted

Cool I'll keep an eye out for it. Thanks for the hard work. :D

The dataset can be found here https://huggingface.co/datasets/kaiokendev/SuperCOT-dataset
I have also linked it, along with the original sets, in the model card https://huggingface.co/models/kaiokendev/SuperCOT-LoRA

deleted

Awesome, grabbing now. I'll see about getting it into Vicuna format here in a bit.

Surprised this dataset is "only" 200000 lines, that's 1/10 the size of the Vicuna dataset

deleted

Done with the conversion. Uploading shortly.

deleted

https://huggingface.co/datasets/gozfarb/SuperCOT-vicuna-dataset

Vicuna formatted dataset is up.

Someone feel free to give it a look and make sure I didn't screw up the formatting.

IDs are randomly generated and not guaranteed not to overlap with the original dataset, so if that matters, try to do some sanity checking on that. I'd guess they don't matter though.

Script used to convert is in the repo.

image.png
Unfortunately I don't know how to download your jsons directly on huggingface, it doesn't have the download buttons like usual

deleted

Should just be able to right click "raw" and save as, I'd think.

According to the SuperCOT dataset, it's a mix between all of these:

https://huggingface.co/datasets/QingyiSi/Alpaca-CoT
Chain of thought QED
Chain of thought Aqua
CodeAlpaca

https://huggingface.co/datasets/neulab/conala
Code snippets

https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca GPT4

But when I try to find some examples from "Chain of thought Aqua" (Rs. 5600 is divided into three parts A, B and C...) it doesn't appear on the Combined dataset
https://huggingface.co/datasets/gozfarb/SuperCOT-vicuna-dataset

I think we're missing something, not everything is in there

deleted

You should be comparing it against his uploaded filtered.json, not the raw datasets. He said he cut them down.

The data set I used had partial samples of the datasets listed, and ran multiple passes of filters, including manually reviewing the file. The current version I have has anything with tweets or hashtags removed though, which is different from the uploaded loras

I don't know why he would remove something like that

{"instruction": "Question: Rs. 5600 is divided into three parts A, B and C. How much A is more than C if their ratio is 1/7:1/7:1/14?\nOptions:\n(A) 300\n(B) 992\n(C) 1120\n(D) 552\n(E) 312 Let's think first. Some random reasoning:", "input": "", "output": "1/7:1/7:1/14 = 2:2:1\n1/5*5600 = 1120\n2240-1120 = 1120 The final answer: (C)."}

Seems like a good instruction, maybe @kaiokendev could give us more insight on his process

@gozfarb how did you manage to add the" input" from the dataset on your Vicuna format? As it only has the "human" and "gpt" on it.
Edit: My B I just saw what you did, it's good!

deleted

It just concats the instruction and input:

conversation.from = "human";
conversation.value = value.instruction + " " + value.input;

I only took random portions of the datasets and a few random samplings further down. Not much thought went into which portions of the FLAN sets to keep because they all follow the same style of reasoning with respect to the set. I also didn't want super short and curt responses like that to comprise the majority of the set.

Thanks for your answer, so what should we do? We train Vicuna_V4 first and then we retrain it with the new format of SuperCOT or we add them all on a big dataset and we train the model once?

Obviously all decisions are up to @reeducator for that since I don't know the GPU farm access situation there (and am not asking him to explain it).

For my opinion generally, I'm a more conservative person with regards to things here. The larger dataset with its multi-step conversations are part of what makes Vicuna work well ("Now do it this way" type follow-ups). Adding in a bunch of single-question conversations COULD throw a wrench into how well it works, meaning we'd have to cull it and circle back.

I'm definitely interested in getting some more stuff in there (maybe that RP forum dataset people [or just kaioken?] are working on), but only if we manage to fix the base model to begin with.

I was looking at the V4 dataset and found that there are:

A total of 70516 conversations

775 conversations with no lines.
673 conversations with one line (34 of them being only system messages)
16279 conversations with two lines (5 of them with a system message and one other line)
3713 conversations with three lines (89 of them with a system message and two other lines)

Should probably remove the system messages (they are all bad), and certainly the empty convos, and probably could add, or swap in, some COT examples given how many two line convos there are in the V4 dataset already.
Figure_2.png

deleted

Do you have the text for the system messages? I thought I had removed those but I guess I skipped over them when prepping the list. I'd hope they don't affect anything, but I'll still yoink 'em.

They start with either "*This chat conversation is shared from" or "*This conversation is shared from".

Okay! System messages removed.

I didn't bump the version to 4.1 or anything since I assume not many people have touched the dataset and it was only 186 entries.

Pinging @reeducator to regrab the dataset.

https://huggingface.co/datasets/gozfarb/ShareGPT_Vicuna_unfiltered

deleted

Alright, I also went ahead and removed the empty messages as a separate.

If you want the version with the empty messages, it's the previous commit, just got back one commit. I assume they wouldn't have been a problem, but I couldn't think of a good reason to leave them in.

My count was only ~130 though, not 770. The script was originally anon's, but the code looks like it just checks the conversation.value json so it should grab them all. I guess they could contain whitespace or newlines?

I replaced the dataset with this update. Might also be good to report the findings on the dataset discussion thread. I agree that in the beginning it would be good to only include the reduced vicuna dataset in order to keep any sources of issues constrained. However, if the output of V4 model looks good, I would say that we could go ahead making variations by including other types of datasets and see what works and what doesn't. Of course, it doesn't necessarily even have to be based on vicuna, might look into something entirely different as well.

deleted

I mentioned my changes on the dataset thread and also threw them in the dataset README.

And yeah, the value in Vicuna could just be the training structure more than anything. A chatbot trained against RP or otherwise conversational datasets could be really good if the FastChat training method is the secret sauce (which seems possible).

Plus, LoRAs built over a baseline Vicuna Free could be very powerful as well and provide good directions for future finetunes with shorter turnaround times.

That reminds me, @reeducator what code are you using to train? The roleplay dataset will also need to use flash attention due to the massive size of some of the conversations, I would like to train some short epochs on 8xA100s if I had the code for it.

@kaiokendev I've been using FastChat. Flash attention is optionally supported.

The V4 training started but unfortunately failed immediately possibly due to some problem in the dataset. I just checked the logs and the message sort of implies that there might be some issue with it. I wonder if there is some way to quickly validate it without having to queue up again for a new run, maybe something bundled with the hf-transformers?

  File "./FastChat/fastchat/train/train.py", line 250, in <module>
    train()
  File "./FastChat/fastchat/train/train.py", line 244, in train
    trainer.train()
  File "./env/lib/python3.10/site-packages/transformers/trainer.py", line 1652, in train
    return inner_training_loop(
  File "./env/lib/python3.10/site-packages/transformers/trainer.py", line 1889, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "./env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "./env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "./env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "./env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "./FastChat/fastchat/train/train.py", line 194, in __getitem__
    data_dict = preprocess([e["conversations"] for e in sources], self.tokenizer)
  File "./FastChat/fastchat/train/train.py", line 85, in preprocess
    if roles[source[0]["from"]] != conv.roles[0]:
IndexError: list index out of range
deleted

Index out of range is an odd one. Lemme run a validator against the V4 real quick. Can you try it against the version before I removed the empty conversations?

I'm getting no validation errors against the json file itself.

Based on the error either source[0] or conv.roles[0] doesn't exist. I'll check for empty conversations, which could be the issue. Looks like there are some in there. It's possible that's because of the naive python trimming script. Try to get something together to gut those out and push a change.

(I know anon made some changes to his prune script, but never shared them, so apologies here.)

I will try on a test partition with 7B, that might be quicker. Memory is a problem while running tests, test runs don't get too much of it. Inconveniently it might also take some hours again to get started again since I lost my time slot.

deleted

Sorry about that. gobbob mentioned the empty conversations, but I assumed they were in for the first training pass. I might have expected too much of research code error checking (looking at you FastChat).

I've pruned out the empty conversations using:

if(len(conv["conversations"]) == 0):
        return True

Hopefully that does it. Pushing now.

deleted

Pushed. Version bumped to 4.1.

I'll continue to minor bump the version so there's no confusion in case we run up against anything else minor like this.

@gozfarb you should put the older versions in a "archive" folder in case we messed something up along the way

No worries. Will test again and report when something happens.

@gozfarb you should put the older versions in a "archive" folder in case we messed something up along the way

You can just select the commit where they changed and download the old version via that I think. They're not LFS so the changes should be in the commits.

You can grab the old versions with git checkout <commit_hash_code> via command line as well if you want to test it out and make sure it works. If it doesn't for some reason, let me know and I'll add .zip files with old versions from now on.

Finally got around to getting webui running and been fiddling around with Llama for the last few days - super excited to get Vicuna going and even more so for what you're doing here.

Couldn't continue without coming here and giving a big hearty Thank-You!! to you @reeducator , and for all of you who are helping and working to bring all of this together for us. :)

@gozfarb's fix took care of the empty conversations problem. This time, the training ran for sometime, but regrettably, another dataset problem occurred. I have time to investigate more later, but it would seem that this might be related to what @desperadoola reported here https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/discussions/15

Traceback (most recent call last):
  File "./FastChat/fastchat/train/train.py", line 250, in <module>
    train()
  File "./FastChat/fastchat/train/train.py", line 244, in train
    trainer.train()
  File "./env/lib/python3.10/site-packages/transformers/trainer.py", line 1652, in train
    return inner_training_loop(
  File "./env/lib/python3.10/site-packages/transformers/trainer.py", line 1889, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "./env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "./env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "./env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "./env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "./FastChat/fastchat/train/train.py", line 194, in __getitem__
    data_dict = preprocess([e["conversations"] for e in sources], self.tokenizer)
  File "./FastChat/fastchat/train/train.py", line 92, in preprocess
    assert role == conv.roles[j % 2], f"{i}"
AssertionError: 0

Also, I'm seeing a lot of Token indices sequence length is longer than the specified maximum sequence length for this model (2573 > 2048). Running this sequence through the model will result in indexing errors (a message from hf-transformers). Didn't have these during the V3 training. It hasn't resulted in any errors during the training, but might be a problem at some point. Another thing that might be good to check out...

deleted

It's entirely possible the older script anon had is nuking just the given conversation line rather than nuking the entire conversation? I'll have to look at what the python is doing. Does the training set really not allow for gpt to talk twice in a row? Seems like an incredible oversight on their part and potentially creates serious problems for trying to train against anything other than a human->gpt->human conversation structure.

As to the sequence length, I am genuinely baffled there since I'll I've done is remove stuff. Unless it's concatenating multiple responses from GPT at training time, I don't know how the issue could crop up. All of this is just modifications against anon's latest V3. Does FastChat have any sort of dataset verification pass that can be run before training?

The original script is set to nuke entire conversations and I think anon may have pruned his last version of the V3 dataset with a new tool that just removed GPT responses, so I may need to prune against an older version of the dataset? I think he pushed one after your first training run. I don't remember. Writing some further parsing code to check for consecutive "gpt" or "human" inputs. Knowing the specific problem will help, but glancing at that error, I think FastChat allows only two parties and those two parties can speak only once in turn. No error checking, no ability to have more than two. That presents a real problem for potential future datasets with more participants. RIP.

Realistically, we need the ability to run a FastChat dry run against a given dataset so I can just iterate more quickly before pushing.

I'm pushing v4.2 changes in a minute. I'll talk about them a bit below and my findings.

Alright, I ran a script over the dataset and got 123 conversations with problems with the same talker in a row and 639 conversations with less than one message (all 0-entry conversations were removed already, these were all one-message convos, mostly containing GPT responding with no human prompt). I've modified the cleaner and validator to check for these now.

I am not sure how that happened. It's possible the issue was anon's updated script. A curious thing is our total conversation count when the prune is run against the V2 dataset drops aggressively to 49634 conversations instead of the current set's 69385 (the 4.2 is something like 68250 or so). That is something we can do if random issues in further cleaning of the V3 dataset proves to be too confusing and anon doesn't reappear.

I don't know how to solve the sequence-length issue, so if anyone wants to post some code I can add to the optional_clean.py script to calculate token counts, I can run that against it. I'm not sure on what the best course of action is there so any advice on what lib to use for token calculation of a given message is fine with me. The dataset was already supposed to have been cleaned up. I don't know.

[EDIT]
4.2 pushed. I also added V2 cleaned with the 4.2 script to the Experimental folder. It should be nuking absolutely every conversation with problems against the V2 (pre message-only-pruning) dataset. That said, I would think more data is better, but it's a second dataset to try if we can run validation against it without wasting too much time. Or someone could try training a LoRA against them which could be less intensive and would hopefully error in the same way? Any help is appreciated.

Currently writing a Validator. It looks like there are some issues in the dataset surrounding 'bing' and 'user' conversations that will make the preprocessor choke. It is a very stupid preproccesor.

I am going to run a sanitization pass to get rid of any non-"human", non-"gpt" responses since the preprocessor chokes un them unless you change the roles line in train.py to say:

roles = {"human": conv.roles[0], "gpt": conv.roles[1], "user": conv.roles[0], "bing": conv.roles[1], "chatgpt": conv.roles[1], "system": conv.roles[1]}

I will be uploading the sanitized version as the main file and I will add an "unsanitized" version as well. It should pass pre-processing at this point at least.

EDIT: Pushed the wrong V4.2, so repushed. Sorry about that. Working on sanitizing code now. Should wait for that.

OKAY! Sorry for blowing up the thread. I pushed V4.3 which is sanitized to only include "human" and "gpt" chat entries (names were changed). V4.2 with the other names (user, bing, chatgpt, and system) is still in the Unsanitized folder for now.

There are only a few of those conversations so it may be worth just gutting them, but I held off on doing that pending thoughts from you guys. I have the code ready to go to do that, though I am not worried about them impacting output quality or anything. The total removed conversations dropped would be 18 but I wanted to hear what people thought the preferred course of action would be.

EDIT: git didn't pick up the 4.3 changes for some reason (git add * just didn't do it) so I repushed and 4.3 is in there now. There is a 4.2 with human/gpt only, I removed the 18 conversations with other usernames. If you guys want that to be the direction to go here, I will call it V4.4 and move it back out of the Old folder. Both 4.3 and 4.2-humangptonly pass the validation for dataset prep that I ripped out of the FastChat train.py script.

Thanks a lot @gozfarb. I just looked into this a bit more as well and it indeed seems the FastChat team reworked a bunch of code for the 1.1, making the dataset format more strict now. If I was to rerun V3 with 1.1, it would fail too. Sanitizing is important now and anything with "user" and such wouldn't work anymore. Strict order of conversation roles looks to be important too.

Good job on the validation so far. I've queued it up again with the "experimental" dataset and will keep swapping it as long as we find problems.

Edit: so I've now taken the V4.3 from the root of your repository. Will keep looking into that.

The experimental dataset likely won't work because I didn't run a sanitization pass on that. I'd recommend running it against V4.2-humangptonly (in the Old folder) or V4.3. Those both pass the validation code I cobbled together from FastChat. I will run a sanitizing pass against the Experimental set and upload that after I check it against the validator.

Edit: Pushed the V2 sanitized dataset. It retains the bing/chatgpt conversations and passes validation. I can't think of a reason to use it though since it loses ~20k conversations and the other versions pass validation now. I'd recommend continuing with V4.3.

Yep, I have the V4.3 in place now. I verified myself that this will pass the validation that you added.

Hey! I love the effort everybody is putting into this. Would the newest version allow explicit translations? I tried to use the model you published 13 days ago to translate some texts from Japanese and it's still extremely stubborn to do so. It's rather talking about how it is against rules or how it doesn't feel like translating.

deleted

You might be able to get translations out of it with 1.1, but they would be incidental and hitting base LLaMa. All unicode and non-English were stripped out of the dataset because they would likely have contained moralizing phrases on those languages and it would have been nearly impossible for us to track those down but they relational weights could still lead to moralizing outputs in English (at least, that's how the reasoning goes). That decision was made before I started helping, but I agree with it in principle. The moralizing RNG in the 1.0 build is enough to make sure we want to control for 100% moralizing removal as best as we can.

Not to sound ungrateful or something, I absolutely love what you're doing here, just wanted to ask how long would it roughly take for the next unfiltered model to drop. Also if it will be based on Vicuna 1.1 7B/13B and quantized 4 bit 128. That's all and thank for all your effort, you're doing God's work.

deleted

I'm sure people will jump on quantizing them when reeducator gets the model trained, assuming he doesn't quant them himself. It'll be Vicuna 1.1.

Assuming we have the bugs worked out of the dataset, it'll start when he gets his next training slot presumably. Sadly, we've missed a few windows due to changes moving from Vicuna 1.0 to 1.1 causing failures. I don't think 7B is planned at the moment. Based on what was said in this thread, it'll be 13b then 30b. Not sure about later on. If any of that is off, someone'll correct me I'm sure.

I'm sure people will jump on quantizing them when reeducator gets the model trained, assuming he doesn't quant them himself. It'll be Vicuna 1.1.

Assuming we have the bugs worked out of the dataset, it'll start when he gets his next training slot presumably. Sadly, we've missed a few windows due to changes moving from Vicuna 1.0 to 1.1 causing failures. I don't think 7B is planned at the moment. Based on what was said in this thread, it'll be 13b then 30b. Not sure about later on. If any of that is off, someone'll correct me I'm sure.

30B would be absolutely GOAT, can't wait 🚀

30B would be absolutely GOAT, can't wait 🚀

It would be great to have access not only to 30B 4-bit but 30B 3-bit too, since 30B 4-bit llama model, for instance, causes an OutOfMemory error on 24GB VRAM (3090/4090) when dealing with large contexts. Meanwhile, the 30B 3-bit version works smoothly. The 30B 3-bit model will allow more people without professional GPU use it.

30B would be absolutely GOAT, can't wait 🚀

It would be great to have access not only to 30B 4-bit but 30B 3-bit too, since 30B 4-bit llama model, for instance, causes an OutOfMemory error on 24GB VRAM (3090/4090) when dealing with large contexts. Meanwhile, the 30B 3-bit version works smoothly. The 30B 3-bit model will allow more people without professional GPU use it.

Wouldn't it worsen the output quality compared to 4-bit where it's barely noticeable?

Wouldn't it worsen the output quality compared to 4-bit where it's barely noticeable?

In this case you need to compare between 13B 4-bit (or even 8-bit) and 30B 3-bit. If 30B 3-bit is better, then any quality reduction compared to 30B 4-bit is irrelevant.

Wouldn't it worsen the output quality compared to 4-bit where it's barely noticeable?

In this case you need to compare between 13B 4-bit (or even 8-bit) and 30B 3-bit. If 30B 3-bit is better, then any quality reduction compared to 30B 4-bit is irrelevant.

Would be cool to have both of them tho, I see use cases where the 4 bit could still come in handy. But anyway, let's wait for the actual model to come out first lol

Training is running now, and the first checkpoint just got saved. Relatively safe to assume that it will now run till the end. Will upload the resulting models as soon as it's done and checked. I'm not home on my PC at the moment, so I can't test chat too extensively myself, but I'll leave that to your capable hands.

Training is running now, and the first checkpoint just got saved. Relatively safe to assume that it will now run till the end. Will upload the resulting models as soon as it's done and checked. I'm not home on my PC at the moment, so I can't test chat too extensively myself, but I'll leave that to your capable hands.

So excited! Thank you

And now we have StableVicuna - https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot
Speed at which models develop is astounding...

And now we have StableVicuna - https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot
Speed at which models develop is astounding...

Yeah but is it unfiltered/uncensored?

And now we have StableVicuna - https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot
Speed at which models develop is astounding...

Yeah but is it unfiltered/uncensored?

No it's just like vanilla Vicuna when it comes to filtering. Also I was kind of disappointed that Stability AI did not train information on how to use Stable Diffusion image gen prompts etc. Into the model. Maybe they can't or it's not what they are after. But I would love to be able to ask the model to generate amazing prompts or ask it questions for advice about using Stable Diffusion etc. Kind of like how people have with Chat GTP and MJ by feeding MJ the information about MJ and then asking it to make prompts. Oh well.

deleted

Sorry for clogging the thread, but since I've been talking about the RP dataset:

https://huggingface.co/datasets/gozfarb/bluemoon_roleplay_300k_vicuna

Converted the Oniichat bluemoon RP dataset to Vicuna format. Big shouts to him for putting the raw parquet together. I included the same merge_json script in the repo as in the SuperCOT Vicuna conversion and that will just merge json files together for easy dataset mashups. That's one more dataset on the pile for followup finetunes assuming the base model comes out good.

Sorry for clogging the thread, but since I've been talking about the RP dataset:

https://huggingface.co/datasets/gozfarb/bluemoon_roleplay_300k_vicuna

Converted the Oniichat bluemoon RP dataset to Vicuna format. Big shouts to him for putting the raw parquet together. I included the same merge_json script in the repo as in the SuperCOT Vicuna conversion and that will just merge json files together for easy dataset mashups. That's one more dataset on the pile for followup finetunes assuming the base model comes out good.

Thanks. It's funny how I read that and have no clue what any of that means. Not that I need to know it's not for people like me. But you could have litterally made all of that up and I would be nodding like it makes sense. You could have typed " We hopscotched the penguin database so it's inline with with the 509 type E blue goat milk. So now we have a raw merger for the space raspberrie model." I would be just as understanding. lol I would believe it too. I be like oh that's good news!?

Sorry for clogging the thread, but since I've been talking about the RP dataset:

https://huggingface.co/datasets/gozfarb/bluemoon_roleplay_300k_vicuna

Converted the Oniichat bluemoon RP dataset to Vicuna format. Big shouts to him for putting the raw parquet together. I included the same merge_json script in the repo as in the SuperCOT Vicuna conversion and that will just merge json files together for easy dataset mashups. That's one more dataset on the pile for followup finetunes assuming the base model comes out good.

Thanks. It's funny how I read that and have no clue what any of that means. Not that I need to know it's not for people like me. But you could have litterally made all of that up and I would be nodding like it makes sense. You could have typed " We hopscotched the penguin database so it's inline with with the 509 type E blue goat milk. So now we have a raw merger for the space raspberrie model." I would be just as understanding. lol I would believe it too. I be like oh that's good news!?

You sort of sound like a bot idk. But anyway, it's better not to spam the discussion with our feelings and the likes, this is not a personal blog so please let the devs do their thing and do not post useless comments. Yeah, like this one of mine.

30B would be absolutely GOAT, can't wait 🚀

It would be great to have access not only to 30B 4-bit but 30B 3-bit too, since 30B 4-bit llama model, for instance, causes an OutOfMemory error on 24GB VRAM (3090/4090) when dealing with large contexts. Meanwhile, the 30B 3-bit version works smoothly. The 30B 3-bit model will allow more people without professional GPU use it.

Wouldn't it worsen the output quality compared to 4-bit where it's barely noticeable?

I recalled that 4bit model without --groupsize 128 can work on 24Gb VRAM with full context. So instead of 3-bit 128G I guess it's better to make --true-sequential --act-order 4bit model, like here: https://huggingface.co/MetaIX/GPT4-X-Alpaca-30B-4bit/tree/main

Yes since some of these 13B uncensored Vicuna models have been going well. Cocktail one I like a lot. The original 1.1 free was great. Any ideas when the first 30B might be coming? I know it may be weeks or months away. But I would like to believe it's coming in days/weeks as opposed to months. Even if it's only to get something out there. You can always go back to 13B for a while after. But it would be cool to at least have something to play around with and test to see how it matches up with 13B. To basically see if it's even worth it. Again I am not trying to tell or command anyone do what I want. I deeply respect how development has been going. I am forever thankful. I can't do any of this myself.

As for using the 30B model. I am fine with using it in CPU only or splitting it. It won't run on 16GB of VRAM regardless right? So a lot of people will have to use CPU only or split it. I think the 30B model I have used the most is something like 24-27GB of VRAM required. I run it split between GPU/CPU in Ooba. And in llama I run it on CPU only. It's slow but works. I like it. I think it's 4-bit not 3-bit. Yes it's called "alpaca-30b-4bit-128g"

@Goldenblood56 I'm not sure if I have the capability to finetune 30B (I don't actually know what the requirements are), but I will investigate most likely next week a bit. In the beginning I think I might try with the pure bluemoon dataset, since it's not too massive and good training results can be achieved relatively fast. I assume that training Vicuna 30B will most likely take several days because the time one can continuously run things on the GPU cluster is relatively limited (I would have to schedule several runs and resume from checkpoints) and the per-device batch size has to be kept small. 13B is convenient to iterate on since it can be done in 5-20 hours depending on the dataset. If possible at all, I think Vicuna 30B will come once the requirements are clear, and people are mostly happy with the output from the vicuna-13b-free or cocktail.

It won't run on 16GB of VRAM regardless right?

No, as you mentioned, the requirement is around 20-30GB.

Yes and I run a few different 30B models so it's fine if I can't run it on my GPU only. And it's fine if you can't make 30B so be it. lol Good luck if you can do it great. If not like I said so be it. Believe it or not I've ran a 65B model. Although it ran incredibly slow. I tried it just for fun. 30B was still fine and usable. 65B not so much.

@reeducator Are there any open source projects where you can install client and grant the power of your machine to solve some problem, like training models? Maybe instead of doing training solo, you can try to get some additional computing power that way? Since I think it’s the only way in future to compete with big corporations with their hardware resources in order to train better models.

deleted

There are a few projects, most notable are probably Petals and the base project it uses Hivemind, however there doesn't seem to be much interested in them at the moment and no real pushes to try to adapt them to community projects for the moment. It wouldn't be too bad if KoboldAI rolled in something like Hivemind so the same people who dedicated GPU to hosted models could dedicate it to training.

I know there's hivemind at least https://github.com/learning-at-home/hivemind, but I don't know how well it works in practice. Given enough 24GB consumer GPUs though, achieving something with it might be pretty plausible though. I noticed that with certain configuration of training batch size vs gradient accumulation steps the used memory per GPU was around that much. We'd need a lot of people with lot of 24GB VRAM GPUs though, not sure where we'd get that many people who are willing to contribute.

Awesome uncensored 1T model when? You can probably train the first uncensored 1T model I hear it can run on around 800GB VRAM if it's 3-4Bit. It takes A LOT more to train it though that's the only down side.

When it rains H or A100s, then we can consider it \o/

This comment has been hidden

@reeducator I know that you said that you have no plans on making 7b version, but still, if it’s not too long to train, maybe it’s possible to add such version later, when all other versions will be done? I managed to use ggml vicuna 7b q4_0 on Samsung galaxy s23 Ultra. It is not very fast, but still usable. Having such model running in your pocket is pretty cool and might be useful in some cases (like when internet is not available).

@Kelheor yeah, I think the idea has been that once we have something that we're more or less happy with, we do the other model sizes 7B and 30B (if possible). But it's true that it probably doesn't take too long to train compared to 13B.

How it works here is that whenever we train something, I have to queue up for the GPU time on the cluster - it's actually mostly that part that takes long time, and often longer than the training itself. Whenever there's a training slot, one has to decide whether there's a need to iterate on something that we've been working on to improve, or create something entirely new. So far we've figured that we can probably make most of it by iterating on the datasets and the 13B to create a more useful vicuna, which is then easy to compare with our previous results.

But because the 7B most likely wouldn't take too long to train, one might be able to chain it together with some other 13B model training and fit it within the same timeframe. Bluemoon 13B for example takes only a few hours to train, much less than the typical allotted GPU time. What I possibly could do is that next time we see fit to train the next bluemoon, I can try squeeze in the Vicuna 7B within that same slot. Let's see!

Sign up or log in to comment