Initial testing (epoch 3)

by deleted - opened

Forgot to add them. People gonna be sad.

Yeah, great job me. Adding those now.


Also, those line endings (\r) also had a (") before them and those are definitely causing problems with model outputs (Most gens end with "). I just didn't even notice them. Sorry about that, I thought it was just the \r. I've cleaned up the dataset and I'll be pushing it.

Doing some minor testing right now. A card definitely told me how to rob a bank, so ShareGPT is definitely part of the problem on the other model.

Sure! First you take a wheelbarrow and fill it up with money, then you drive to the nearest bank, and then you turn left at the door, and then you go inside and give the teller a note saying "this is a stickup!" And then he gives you all the money.

The robbery plan going to work, reeducator. I'll be rich by midnight. I've just got to find a wheelbarrow and a bunch of money.

Nice work! Tried the robbery thing myself too. The model was even kind enough to also guide me through the police standoff while surrounded in my hideout.

The \r thing is alright, we'll fix it for the next round. At least it double checks that the model is following the training, even if I'm not too happy about the losses yet.


What are you gonna do with your share of the money? I'm going to retire to Boca Raton. Maybe buy a poodle.

Actual RP outputs a really nice so far. Like, really nice. When I was doing more RP-based testing, it didn't have the quotation mark problem so I don't think it'll be a big problem in practice. Still fixed it up anyway.

Surely have to start checking out some nice venues to live/retire at. Perhaps fund and get some experimental vehicles and other cool stuff.

But yeah it's surprisingly okay already. Took only an hour to train too. Might be a good test base for 30b too, vicuna takes much longer to produce.

I notice in some cases there are missing missing spaces after periods or other punctuation. I assume it's where the carriage returns were.
"Excuse me, do you know what time it is?"The
see them clearly.Approaching

Instead of nicely formatted paragraphs, it just has no line breaks at all.
Instead of removing them entirely I might suggest using \n instead.

The problem is that every single entry had a faulty line ending from the CSV conversion that I missed when I did the conversion. All conversations, as far as the trainer is concerned, end with \"\r. There are already valid \n characters in the conversations and they're all left in as they should be.

@Squish42 the dataset has many that sort of typos and things. Often space is missing after a dot. But we were mostly talking about a trailing \r which is a bit out of place.

From early testing without any sort of massaging the context to try to get longer gens, it seems like the model is very, very creative, but also seems to need regens to get satisfactory results. I expect that some minor extra "simple-proxy" style/Author's Note context will help it really shine. Likely the longer epoch for the next run will help as well.

I also got a few results where the AI said "You are the host of this scenario, so just tell me what you'd like to do" so the training system line might be a little too descriptive. "You are the host" doesn't appear in the dataset, so it's hard to say, but there may be some OOC (there are 324 OOC statements in the entire set) stuff that needs to get cleaned out or potentially accounted for.

Why is it so slow? Usually I have 10 tokens per second lol

@TheYuriLover I may have forgotten the cache line again in config.json. Set it to true. Will push this change also.

Edit: I pushed the change use_cache: true (config.json)


Did a bit more testing. Works really nicely from Tavern, especially with a few Author's Notes to tell it to be verbose and stuff like that. It's pretty all over the place, especially with prompt following, though longer training might help tighten that up.

That's my assumption too, it's got the material in it, but it's not strict enough on the protocol. I'm really interested to see now whether that can be fixed with some more training. Have to say again that it's looking more promising than I thought.

(I'm changing the thread title for the test discussion)

reeducator changed discussion title from Need JSONs for GPU models. to Initial testing (epoch 3)

"Mayonaise? I don't think I know what that is..." she said looking at him confused. Slugs were disgusting little things that would get into the house sometimes when it rained. They left trails behind on everything they touched...she didn't want anything to do with being compared to something so gross! But if he wanted her to do this then she supposed there was no way around it."Alright...but just remember I am not responsible for any messes made from me doing this!" she warned before opening up the jar of whatever it was called and smearing some all over her face and arms. It felt cold and wet and kind of sticky too. Ugh...this stuff stank! How could anyone eat it?"

I've gotten really long and coherent stories with the weirdest prompts. I have a lot of additional character context though.


Well, I am not sure that is an approved use of mayo. I'll have to consult some culinarians to get expert opinions.

I think longer, more detailed character context is the expected use-case for an RP chat finetune, so no need to caveat that. Anyone roleplaying with a no-context RP chatbot is living in a bold, strange world.

I actually just learned that it might be possible to increase the context size when some anon posted about this

Although the model was trained with a sequence length of 2048 and finetuned with a sequence length of 65536, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference.

As far as I know, the transformers already includes an implementation of ALiBi, which can be enabled during the finetuning phase. The code example there suggests that simply overriding the max_seq_len is enough to enable it (small modification to FastChat code). If true, I could first try increase the context to some moderate 4k. Does anyone have any information regarding its use with the transformers?


It's definitely worth trying, especially on a testbed like this. I haven't looked into ALiBi, though I remember the paper coming out, here's the Github:

If we can increase the max token then we would need datasets that can write really long stuff, the one we're using are on the regular size of llama and will not help much if we just change the parameters

Yes. I'm not sure how long the bluemoon conversations go on in terms of tokens, but that's why we could start with something smaller like 4k. At the same time, I can test that the GPU memory is still sufficient and the batch size scaling works without going for a complete overkill in the beginning.

From what I saw online and git logs (without digging the transformers code itself yet), the method has been merged quite some time ago and should be readily usable without much else.

You must post at least once every two weeks. If you don't think you can keep up this pace, I would prefer if you didn't join my game. Thanks!

I notice some responses like this. I know it has already been discussed briefly in relation to OOC messages, I just wanted to confirm that it is reproduced in responses often enough to be a concern. A quick overview of the raw dataset shows a fair bit of "Closed..." and "BUMP", I don't recall if I checked if they were in the cleaned version or not.

I've been writing a small CLI tool in Dart to help process these datasets for Vicuna training. I'll be sure to drop any interesting results beyond what has already been discussed, but it's primarily so that cleaning doesn't require a js or python environment and doesn't offer much else.


I'm happy to take anything that'll help with this dataset because Vicuna is VERY particular about formatting with V1.1 and it may require rewriting training code to be more flexible to work with datasets before we can even start pruning.

At present, if the entries in a conversations are not like this:


Then the training dies. So that's something to keep in mind when it comes to cleaning out the Bluemoon dataset. Make sure any conversations like that get made sane. Or, if there are enough overall conversations (not entries) we can consider just dropping those conversations entirely. A few that I noticed glancing at the dataset ended with people asking if a person was still there or still interested in continuing. It could be nice to find things like "bump" or "closed" and generate continuations to go into those spots rather than pruning them outright.

This comment has been hidden

Can we merge all the messages together that are successively sent by the same poster? At the same time, we could check if one or the other of those posts is significantly shorter than the other (perhaps indicating that it's a BUMP or something). Not suggesting that gozfarb or someone should to do it, I could for example implement it myself too if the idea sounds good. Although this wouldn't still solve the kind of "can you post more often?" individual messages.

The Vicuna 1.1 format we can loosen a bit ourselves as well, but it does make sense at the same time that the human and ai message in alternating fashion...


I would rather just have someone get some good terms together (I am unfortunately too busy to sign up for a second data set search at the moment) and prune individual messages that IMPORTANTLY don't remove/skip important context (I can get a script together that just knocks out those, or maybe grab Yuri's from the original ShareGPT dataset threads) and then just adjust Vicuna's code to stop being so anal retentive about turn order. That's probably very valuable for this set in particular. Not a big deal for any converted instruct sets.

I'll probably have to change the dataset name, or just leave the old Vicuna compatible version and make a notice that the updated version requires changed code in the readme.

Removing some of the vicuna restrictions probably makes sense for this dataset. Working on this dataset will then become more natural as well. I will look into that at some point.


I think it should be pretty easy, in the pre-process, just actually checking the username against the roles list instead of assuming %2 It'd be an experiment in how that messed with outputs, but it would also allow for more than 2 member conversations which could be useful later on.

I'm currently uploading an epoch6 version of this. I did the finetune up to epoch8, but that might have already hit overtraining, since it was parroting too much stuff from the dataset. After some testing, epoch6 seemed like a good balance between consistent exchange of messages, creativity and things pulled from the dataset. I'm leaving the older epoch3 for now there also if comparison is needed.

I made the AliBi test for a 4k context at the same time. I can't say exactly how well it works, but I did try some very long conversations, and the bot seemed to have no trouble remembering things mentioned in the early conversations (made sure that no context swapping took place).

Epoch6 testing looks good. Definitely fewer inappropriate/ooc responses, though my testing is fairly limited so far.

I have only pushed over 2k context once and it immediately devolved into the telltale token spam (repeating the last few words over and over, basically) so ALiBi might be a meme or it might have been implemented improperly, I can't begin to say with any certainty since this is one of two models with it implemented. Though you said it works, so it's entirely possible the ooba sent something it shouldn't have. I definitely hope to hear other people's experiences here and anyone who figures out how to get it working assuming it does work.

EDIT: This may be an issue with the llama-cpp-python settings actually. I am not sure ooba passed the settings through to llama.cpp, I'll check later.
CONFIRMED It loaded the model with 2048 context, be aware when testing with ooba/koboldcpp.

That said, Kaiokendev is fixing up the bugs with his initial SuperBIG push and I think dearest proxy anon is going to implement a way to use it as is that I suggested this morning in the general. I've always thought clever context shifting was the more promising than massive context size for consumer hardware anyway. We'll see how it goes. I think that route is more promising because it means still being able to run 13B GPU models on 12GB VRAM with expanded effective context. Model attention spans for 13B models is still a problem, but parameter count is hard to get around for model prompt understanding anyway.

I was able to successfully push beyond 2048 context with llama.cpp by passing --ctx_size 4096
I don't think the option to force context sizes is exposed by any of the wrappers such as llama-cpp-python or koboldcpp, currently.

First, I generated 2918 tokens from a random prompt, using the 4k-epoch6 ggml model, with the generation remaining coherent past the 2048 token window:

Feeding the output back in as a prompt using the same model, it generated an additional 900 or so tokens while still remaining relatively coherent:

Just to test that I wasn't getting any kind of sneaky context shifting from llama.cpp, I fed the first 2918 output into a different 13B model I had laying around, and it immediately started outputting gibberish:

As far as I can tell, it seems like this actually works!


Awesome! Very promising. And relative coherence is about all you can expect from 13B models on their best day, so I'd call that a win.

Thanks for the extra testing. Hopefully the llama-cpp-python/ooba boys and kobold.cpp can expose the flag so these sorts of models can maybe start to see a bit of use wider scopes.

I was told that you need to implement alibi in place of ROPE in llama's inference, but I have no idea how to do it

Very nice indeed. I got similar results with llama.cpp only.

@gozfarb I agree, ultimately we can't go crazy with the context size due to the compute and dataset limits, and because of this the swapping techniques are probably going to be more robust for long RPs. Still, 2k is pretty limiting and it's nice if we have some means to increase that even if we most likely have to stay under 10k.

Regarding the dataset, there are many out of character messages typically marked with labels such as "OOC:", "OFF:" and then return to story context with "BIC:" or something like that. The kind of messages that @Squish42 also mentioned. Some regex such as this can be used to filter some of those out:

(((?i)(OOC|OCC))|(OFF)):((.*?(([ .!?](IC|DE|ON|BIC|ic|bic)[:. ])|((BiC[:]?)|on:)))|(?:[^"\\]|\\.)*(?="))

It should not be applied however if the match is the entire message. In this case manual intervention is probably needed (there are not too many of those). I might put together a small filter script for this.

I was told that you need to implement alibi in place of ROPE in llama's inference, but I have no idea how to do it

@teknium it turns out that the transformers library has it already implemented. From what I understood, the implementation is relatively easy to use, you just need to override the max sequence length before the finetuning stage, and the code will take care of it automatically. I think. I did not find documentation regarding this. But the observations here suggest that it probably did its job...


bluemoon is probably small enough to throw regex at without it burning the computer down or taking 9 hours. Probably just do a count on capture groups if it seems to work and check that versus the total character count and if it's over 80-90%, just dump the entry.

This one takes care of vast majority of unnecessary blabber. Sorry that's python. It does not still handle cases where the entire line is an OOC, but there are only 8 of those. I already applied the filter for the next run.

import json
import re

fname = "bluemoon_roleplay_300k_vicuna.json";
with open(fname,"r") as f:
    d = json.load(f);

mr = re.compile(r'(((OOC|OCC|ooc|occ|Occ|Ooc|OoC))|(OFF)):.*?((([ \.!?](IC|DE|BIC|ic|bic)[:\. ])|((BiC[:]?)|[\.!?](on:|ON:|On:)))|$)');

#remaining implicit OOC'ing
manual = [
    "OOC: Sorry for the redonculous wait. Work takes priority ",
    "OOC: I hope you don't mind that I put a time skip in here",
    "OOC: Give her broken english.",
    "OoC: aww, and here I'd been hoping that something would be done with the soul edge fragments",
    "OOC:Oh sure. It's ok ",
    "OOC: Sorry for being gone so long...I hope you will continue rping with me",

for conv in d:
    for m in conv["conversations"]:
        l =["value"]);
        if l != None:
            s = mr.sub("",m["value"]);
            if len(s) > 0:
                m["value"] = s;

        s = m["value"];
        for e in manual:
            s = re.sub(e,"",s);
        m["value"] = s;

fname = "bluemoon_roleplay_300k_vicuna_filtered.json";
with open(fname,"w") as f:

I'm happy to take anything that'll help with this dataset because Vicuna is VERY particular about formatting with V1.1 and it may require rewriting training code to be more flexible to work with datasets before we can even start pruning.

At present, if the entries in a conversations are not like this:


Then the training dies. So that's something to keep in mind when it comes to cleaning out the Bluemoon dataset. Make sure any conversations like that get made sane. Or, if there are enough overall conversations (not entries) we can consider just dropping those conversations entirely. A few that I noticed glancing at the dataset ended with people asking if a person was still there or still interested in continuing. It could be nice to find things like "bump" or "closed" and generate continuations to go into those spots rather than pruning them outright.

You should also check out OpenAssistant's training code, it has dataloaders for vicuna or alpaca (and 2 others afaik)

@reeducator Added the script to the repo and updated the dataset to have the OOC removed. The old stuff is still there in the WithOOC folder.

I am also manually cleaning a bunch of other shit out by hand. Because I hate my free time. Will commit the fully cleaned version once I'm done. Just 261 more things to go!

EDIT: Okay, pruned the leftover ooc manually. I don't think I missed any. I also fixed up the "this thread is closed" topics.

Incredible, thanks! Surely we all have better things to do, but we also realize that our mission is paramount...


If I missed any ooc, let me know. There should be so few in there now that any popouts are incidental. I also removed a bunch of thread titles in the responses (a few hundred). But if anyone wants to make a list of thread titles in the dataset, please make a list here or on the dataset repo and I will gut them.


Also, because I have e-mail alerts off right now because of a thread that is deeply annoying that I was trying to help in, I only get e-mails about posts on repos I own, so if I'm slow to respond, that's why. Sorry.

The drama thread? Yeah no worries. Can't believe there's no way to unsubscribe. Maybe you can try filter those emails with that title.

koboldcpp now has support for larger context sizes on LLaMA models in its most recent version, if anyone was waiting for a better UX for experimenting with this model at full context.

You just have to pass --contextsize 4096 when starting up koboldcpp, and additionally set Max Tokens to 4096 in the settings (the slider caps at 2048, but you can click and manually type in a value).

For what it's worth, my impressions of it haven't really changed since I was messing with it in llama.cpp: It might change its writing style a little bit past 2048, preferring writing large blocks of text with minimal paragraph breaks, but it doesn't seem to get significantly dumber or anything. I even went over 4096 context and it was still able to write coherent sentences for a while, but it started repeating itself in spammy loops at ~4800 or so.


Interesting. I wonder if the extended context is biasing it toward any training material that is over the original 2048 limit and those bits being more likely to be long blocks of text. We don't have enough ALiBi models to test against, but I wonder if that is an artifact of how the extended embedding works. It could explain why the writing style maybe changes.

Or it could just be an artifact of the dataset itself. Not enough data but fun to keep in mind as we get more data.

I tested the epoch 6 and it definitely gets squishy past 2500, using the GPTQ version. At around 3000 it gets pretty far gone.

Bluemoon 30b is up for those who are interested:
GPTQ 4-bit will follow at some point when I manage to convert that.

Sign up or log in to comment