What an honour!
I am very flattered by you spending time with my model. The context window should have been 16k, but I made the completely believable rookie mistake of overlooking --cutoff_len defaulting to 1024 in Llama Factory. Assuming that is the problem, I will fix it next week. Have a great weekend!
Is the context window for the smaller window correct?
// oh wait.. this IS the smaller model.
@BoscoTheDog
both models seem to work until about 3k context.
I tried to use them at 16k, then reduced context step by step to find out when they start writing properly.
@cgus You'll absolutely be notified, and hopefully with good news for everyone.
@BoscoTheDog The context size of the base model is 16k (4k sliding window), and I didn't realise that instruction/input would be truncated at 1k by default when I trained the new layers. Some of the training data must have had about 2k token responses for the model to break down at about 3k. It would make sense.
What I intend to do is training a LoRA for a couple of epochs with 32 for both rank and alpha (for a start) on Yukang/LongAlpaca-16k-length and Norquinal/claude_multiround_chat_30k. This time remembering to increase the cutoff limit. I'm not proud of that oversight, but it is quite exciting to fix.
Sounds groovy! I've already added Ninja Mouse as an available model in my soon-to-be-released fully browser-based project. Small models with large contexts is what's most interesting to me since I want to recommend users switch to those for summarization tasks.
@BoscoTheDog Right?! The potential with small models is huge, even though this one didn't really pan out as expected. I'll be training another model the coming week that can actually handle a large context without this type of output:
the hypoth44
, isDing..Iiv2for2 iss is.itpt..,,.23r of in)
)ose**.im. conclusioned note beder.:sfor
**ize for, are. ..G
I'm looking forward to see and try your project btw.
@cgus It seems like a problem with the base chat model. h2oai released another version, with an 8k context window that I have tested and that seems to work just fine. All the 16k versions starts breaking down at 4k input tokens, but the uncertainty of whether it will work have made me look for a better foundation to build Ninja Mouse on. RoPE scaling should be possible on danube2 though, so that is the model I will be working with.
@trollek
I see. Yeah, training to extend context could work. Most notable example how Deepseek-coder 1.3B, 6.7B have their context extended from 4k to 16k via RoPE scaling with quite amazing results.
It's also possible to use alpha value param to extend context without training but quality might suffer a bit, though I do use it a lot with some of my favorite models to extend 4k context to 8-16k during loading.
On a side note, for now I made exl2 quants for h2oai/h2o-danube2-1.8b-chat. I didn't test it extensively but it looked fairly reasonable in a few dialogues and minor test I tried.
@BoscoTheDog
if your app supports exl2 quants, you could try the model
@trollek
mentioned, I just quantized it: h2o-danube2-1.8b-chat-exl2. There are also GGUF quants available made by Bartowski and others.
I had a brief test and it seemed to be able to process a 7k token long article without breaking, and it wasn't even an English article and yet it got the gist pretty well.
if your app supports exl2 quants
@cgus I don't know. It's not an app, it's a website. 100% browser-based, running on https://github.com/tangledgroup/llama-cpp-wasm
if llama.cpp supports exl2, then it should too.. right?
@cgus It took longer than expected, but I have finished the second version of NinjaMouse.
@trollek
Cheers. I've just finished making exl2 quants for it.
Edit: I'm planning to make iMatrix GGUFs as well some time later.
I noticed :-) Is the context size truly 16K now?
@cgus Could you perhaps share your .gguf files on Huggingface?
@cgus That is terrific. I haven't had the time to look into those quants yet, and I don't know if they work with Ollama which is still the heart of my fairly basic setup. Do iMatrix quants have their own suffix for lm-studio and alike, e.g. -GGUF?
@BoscoTheDog It's only 8k, same as the new base model, but it is actually able to use all of it. I would love to try extending it to either 16k or 32k, but I have not had the guts to rent cloud GPUs with enough VRAM yet. I want to try everything with NinjaMouse eventually. Long context, multimodality, QuietStar thinking, MoE, MoD, and whatever the next mindbending thing will be. I am however not sure that I will extend it beyond the current 34 blocks (or layers). It is very plausible that some company will release a 2.5B or 3B model waay better than mine at any time these days. TL;DR I will try to extend the context, but for now I am pretty happy with it only being 8k. For now :)
@trollek
No, they can have exactly the same names as usual GGUF models. It's mostly just a calibration feature that allows to save more important data during quantization.
I never tried Ollama but I suppose they should work, this feature is several months old.
I made GGUF quants but apparently they all finish messages with "<|file_separator|><|endofthought|>", so I'm looking into this issue for now.
@cgus While I have added those tokens, I have not actually used them in any training yet, or seen it in my own testing. Logically the model shouldn't even know they exist. Very odd indeed. I have actually tried forcing <|startofthought|> and <|endofthought|> with a system prompt I found on the OpenWebUI site, trying to emulate quiet star, but without success.
@trollek
I managed to make GGUF quants with iMatrix calibration. I figured out it was llama.cpp issue and after multiple futile attempts to fix it, I just used a newer version and it solved everything. I also uploaded my imatrix.dat file just in case if anyone wants to make their own iMatrix quants.
Strangely enough, GGUF quants work pretty well but exl2 quants and even the original Transformers model have weird output:
I think I found two sources of the problem:
- Something with tokenizer, it only generates properly if I set "use_fast=False".
- Template issue here:
<|im_start|>assistant\n
If I remove "\n" after assistant, it generates properly.
But only if tokenizer is loaded with "use_fast=False". Otherwise it still generates strange output with or without "\n" in the template.
@cgus I think you are right about the tokenizer, and I think it is because I have tried doing something in a way that I was not suppose to with the template. I had added "<|im_start|>assistant\n" to the line for the assistant, like a fool, instead of at the end of the user query. As I understand it, there has been some issues with the fast tokenizer, but when I have tested the quants there haven't really been any. Hopefully changing the tokenizer to the slow one and editing the template will help. Thank you so much for figuring this out! It would have taken me ages, and damn near did. Incidentally I found a fix for the space at the start of every new line. I'm also slowly realising what I am not suppose to touch and why, and it actually feels like a "there is no spoon" moment.