Sequence length error

#1
by Huzderu - opened

Hi, thank you for the merge and for the quants! I think /sophosympatheia achieved a definite improvement over the 1.0 version.
Unfortunately, I've personally run into an issue with your EXL2 quants. At around 2000 tokens context, the model stops generating and checking the log I see this error:

Traceback (most recent call last):
File "/app/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/app/modules/text_generation.py", line 397, in generate_with_callback
shared.model.generate(**kwargs)
File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
outputs = self(
File "/app/modules/exllamav2_hf.py", line 138, in call
logits = self.ex_model.forward(seq_tensor[-1:].view(1, -1), ex_cache, loras=self.loras).to(input_ids.device).float()
File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/exllamav2/model.py", line 553, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward

This "AssertionError: Total sequence length exceeds cache size in model.forward" happens on both GPU's I've tested, no matter the context size I set. I've tried both the 3 bit quant and the 3.5 bit quant with the same result. It's unfortunate, because this merge is really great, but I can't make it work on my end.

Hi ! That is weird, I've tested the 4.0 and 3.5 bpw versions myself quite extensively without issues, at 32K context. While I have not tested the 3.0 quant, it should be the same.
Do check that all your dependencies are up to date.

I have included the measurement file should you want to try to make your own quant (And skip the measurement step). Let me know if that fixes your issue :)

For testing, I used TabbyAPI with exllamav2. Both up to date.

Hi ! That is weird, I've tested the 4.0 and 3.5 bpw versions myself quite extensively without issues, at 32K context. While I have not tested the 3.0 quant, it should be the same.
Do check that all your dependencies are up to date.

I have included the measurement file should you want to try to make your own quant (And skip the measurement step). Let me know if that fixes your issue :)

For testing, I used TabbyAPI with exllamav2. Both up to date.

Thank you. I'm not very knowledgeable when it comes to quantizing, but I'll figure it out. If I find a solution, I'll post it here, in case anyone else has this issue.

I've noticed no matter what truncation settings I set in Oobabooga, this model always loads with truncation length set to 2048. I think that's what makes it stop generating around 2000 tokens. I'm using a docker image of Oobabooga on a cloud GPU, with openai API to connect it to sillytavern. Tried messing around with the config files in Oobabooga folder, but no luck so far. Thought I'd share my findings.
Screenshot 2024-03-24 095458.png

This comment has been hidden

Experiencing the same exact issue within the 4.0 version of this model on a A100 GPU and the the docker image "valyriantech/text-generation-webui-oneclick-ui-and-api"
Also the same problem after using the latest text-webui from git and its pip install -r requirements.txt -U

Also got the same truncation length @Huzderu mentioned.
Funny thing about that number is that everything above that number in context fails around stories or chat that exceeds that number.
And when I lower the context to 2048 the model starts working again.

Friend, here is how I fixed it. For some reason, if I load MidnightMiqu it always sets the context length to 2048. But, if I load another model first with 32768 context, then unload it, and load MidnightMiqu after, it works. So what I do is download both MidnightMiqu and another small 7B with 32k context, load the small model first, unload, then load MidnightMiqu. I hope this helps anyone else with this issue.

Experiencing the same exact issue within the 4.0 version of this model on a A100 GPU and the the docker image "valyriantech/text-generation-webui-oneclick-ui-and-api"
Also the same problem after using the latest text-webui from git and its pip install -r requirements.txt -U

Also got the same truncation length @Huzderu mentioned.
Funny thing about that number is that everything above that number in context fails around stories or chat that exceeds that number.
And when I lower the context to 2048 the model starts working again.

I posted my solution and forgot to quote you :)

Sign up or log in to comment