What's the context size ?
Thank you, this model is doing extremely good AND it's running on my laptop! I love it.
Sorry if I missed the information somewhere but what's the context size please?
should be 32000
Yes, same as Mixtral! 32k
Thank you!
@AliceThirty what kind of GPU in that laptop? I've been unable to get it working with 8gb VRAM and 64gb ram :thinkies:
@papercanteen111 According to Mistral AI, original Mixtral needs around 110 gb of VRAM (~2x V100 80gb in cloud), but you can run quantized gguf version on CPU&gpu (I'm running 4 but quantization on 64 gb ram + 8gb VRAM and it works pretty well).
@papercanteen111 Experimentally I've succeeded in making a q3 of this model "run" on my GPU with 12gb VRAM, but it would take actual 8 minutes to load 2k context and take 3 minutes to produce a response at 0.4 tokens per second. I'm toying with it on a rented 48gb VRAM GPU and it generates faster than my eye can read in that environment.
Additionally there's some researching regarding HQQ + MoE offloading that has succeeded in making Mixtral 8x7b run at usable speed in 12gb VRAM happening as of two days ago. Maybe there will be an easy way to generalize that effect for all MoEs soon?
@AliceThirty what kind of GPU in that laptop? I've been unable to get it working with 8gb VRAM and 64gb ram :thinkies:
Sorry for the missunderstanding, I am not running this model but the GGUF version (Q5_K_M) on my laptop. I have 64GB of RAM and a rtx 3080 ti with 16GB of VRAM
(I opened this thread in the wrong page)