Weights broken ?
Hello there,
I downloaded both the 4 bit 32 and 128g weights, and on my machine the model spurts out only gibberish.
I used Text gen webUI as backend with Exllama V1 and Exllama V2 for testing with multiple parameters.
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --alpha_value 2 --max_seq_len 8192
(Sillty tavern for the front end)
Other models work perfectly.
(Xwin 70b for example)
Can anyone confirm this or am I just an idiot >_< ?
Can you show me an example of the gibberish - is it one word repeated over and over?
--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value
exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2
Hello Again, sorry for the late reply (I did some testing around after Yhyu13 posted his comment)
(I used the 128g 4 bit weights for testing this time)
Can you show me an example of the gibberish - is it one word repeated over and over?
Yes , it is kind of like that. (this time i used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --max_seq_len 16384)
--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value
exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2
I think you are right but i can not test it , my 2x 3090 do not want to load the weights with :
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 24,24 --max_seq_len 32768
I guess that one is on me >_<
Yes, that's a sequence length issue as we thought
Can you try with --max_seq_len 8192
- and no alpha parameter specified
Okay , i was able to do inference with --max_seq_len 32768
I used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 32768
Yes, that's a sequence length issue as we thought
Can you try with
--max_seq_len 8192
- and no alpha parameter specified
Of course,
here is the result with using : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 8192
Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths
What about if you use --compress_pos_emb 2 --max_seq_len 8192
- you'll need to check that's the correct name for compress_pos_emb, but it's something like that
Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths
What about if you use
--compress_pos_emb 2 --max_seq_len 8192
- you'll need to check that's the correct name for compress_pos_emb, but it's something like that
This was an excellent idea actually, I tested it now with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 8192
And.. it is.. ahm.. kind of okay ?
Then i tested it with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 16384
And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 4 --max_seq_len 16384
Ahm.. okay.. din't know that our tower was half a kilometer long.. O.o
And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --compress_pos_emb 8 --max_seq_len 32768
I guess this is the way to go then , i initially thought that this model does not need compress_pos_emb or alpha value to function.
So.. i guess we can close this one now ?
I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.
I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.
Just note, that exllama_hf uses huggingface implementation of transformer which is much slower than exllama with flash attention on cuda devices
exllama_hf at max hit 40% usage rate for single card inference for a 7B model on RTX3090, where as exllama with flash attention would easily achieve >95% usage rate
Thank you. I’ll try the model out on my build on Ubuntu, that has FA2 installed. First run was Windows.