turboderp/Mixtral-8x7B-instruct-exl2 · bpw and corresponding vram usage

joujiboi

Dec 17, 2023

How much vram does/will each bpw require?

eramax

Dec 17, 2023

following

CamiloMM

Dec 18, 2023

•

edited Dec 18, 2023

Out of an abundance of caution I downloaded the 2.4bpw for my 24GB card (because 2.4bpw fits a 70b model) but, 2.4bpw should fit in a 16GB card instead. Which is awesome.

Edit: Nevermind, 3.5bpw is quite clever. 2.4bpw is too dumb, don't use that one. If you need to use 2.4bpw, use a 13B model instead, unless you really need the context or it works for you.

jonwondo

Dec 18, 2023

3.5bpw uses 22.2GB of VRAM. It looks like if a 3.7bpw were created it would still fit on a 24GB VRAM card and maybe perform slightly better.

brucethemoose

Dec 18, 2023

3.5bpw uses 22.2GB of VRAM. It looks like if a 3.7bpw were created it would still fit on a 24GB VRAM card and maybe perform slightly better.

Is that with the full 32K context?

brucethemoose

Dec 18, 2023

The answer is "not quite"

Looks like you need 3.3bpw-3.4bpw to fit 32K on a completely empty 3090.

brucethemoose

Dec 18, 2023

Yeah, 28500 context is precisely what my GPU can fit before OOM.

AliCat2

Dec 18, 2023

•

edited Dec 18, 2023

You can do the full 32K completely on a 3090 with 3.5bpw, fp16 cache, with about 1GB VRAM Windows usage, and without CUDA - Sysmen Fallback Policy while using TabbyAPI w/ CUDA 12.x, Python 3.11, and Flash Attention 2.

The usage is about 22.6 GB~ VRAM at 32k (or about 23.6 including the 1GB VRAM Windows usage)

Is it worth it or practical? Not on my machine or with the current version as the T/s was about 0.13.
Metrics: 78 tokens generated in 585.05 seconds (0.13 T/s, context 32353 tokens)

Speed around 4k is fast:
Metrics: 90 tokens generated in 2.17 seconds (41.43 T/s, context 4219 tokens) <-- Not accurate (truncated)

brucethemoose

Dec 18, 2023

•

edited Dec 18, 2023

You can do the full 32K completely on a 3090 with 3.5bpw, fp16 cache, with about 1GB VRAM Windows usage, and without CUDA - Sysmen Fallback Policy while using TabbyAPI w/ CUDA 12.x, Python 3.11, and Flash Attention 2.

The usage is about 22.6 GB~ VRAM at 32k (or about 23.6 including the 1GB VRAM Windows usage)

Is it worth it or practical? Not on my machine or with the current version as the T/s was about 0.13.
Metrics: 78 tokens generated in 585.05 seconds (0.13 T/s, context 32353 tokens)

Speed around 4k is fast:
Metrics: 90 tokens generated in 2.17 seconds (41.43 T/s, context 4219 tokens) <-- Not accurate (truncated)

I think that means you are actually OOMing, even if the monitor doesn't show it? On linux I OOM hard at 28K, and its crazy fast up to then.

Actually I can probably get it up a bit more by changing the chunk size... Exui defaults to 2048, other UIs/APIs default to less I think.

AliCat2

Dec 18, 2023

I think that means you are actually OOMing, even if the monitor doesn't show it? On linux I OOM hard at 28K, and its crazy fast up to then.

Actually I can probably get it up a bit more by changing the chunk size... Exui defaults to 2048, other UIs/APIs default to less I think.

I think you're right. When double-checking the the Sysmen Fallback Policy, it appears it wasn't on for this instance of python, which explains the slowdown, because it was likely splitting the GPU with CPU.

goldrushgames

Dec 19, 2023

This model is amazing, on par with GPT 3.5 or better...so far better than dolphin or other fine tuned

eramax

Dec 20, 2023

fyi. I was able to load this model (2.4bpw) on colab T4 instance with 15GB Vram

Jostack

Dec 20, 2023

Running inference on model: /home/neuron/exllamav2/models/Mixtral-8x7B-instruct-exl2_2.4bpw
-- Model: /home/neuron/exllamav2/models/Mixtral-8x7B-instruct-exl2_2.4bpw
-- Options: ['gpu_split: 23,23', 'rope_scale: 1.0', 'rope_alpha: 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

Once upon a time,TDM was the largest and most popular game mode in CS:GO. Today, it is still one of the most played modes, but it has been overtaken by Valorant’s “Solo queue” mode.

The reason for this change is that a lot of gamers don’t like playing with strangers. They prefer to play with friends or known teammates. This makes TDM less attractive for many players.

However, there are still plenty of reasons to play TDM in CS:GO. Here are some of them:

Teamwork: TDM is all about teamwork. You need to work together with your teammates to eliminate the enemy team. This makes it a great mode for practicing teamwork skills.
Fun: TDM is a lot of fun. It’s fast-paced and action-packed. You never know what’s going to happen next.
Competition: TDM is a competitive mode. You can earn points for each kill, and the team with the most points wins. This makes it a great mode for practicing your competitive skills.
Variety: TDM offers a lot of variety. There are different maps, weapons,

-- Response generated in 5.91 seconds, 256 tokens, 43.33 tokens/second (includes prompt eval.)

2 rtx 3090 nvlink

goldrushgames

Dec 21, 2023

With 2x3090, you can go a lot more than 2.4 bpw. I'm able to load 3.5bpw on a single 3090

Jostack

Dec 21, 2023

@goldrushgames

Yes, I am trying different combinations and the 8.0bpw model is the one that I cannot start.

Prasanthin

Jan 8, 2024

•

edited Jan 8, 2024

Hi,
I want to download 3.0 bpw. Do I need to do manually? I am using git clone but no getting the config files and all and trying to use AutoModelForCausalLM but failed. Could you please guide me how to download and deploy using exllamav2

eramax

Jan 8, 2024

Hi,
I want to download 3.0 bpw. Do I need to do manually? I am using git clone but no getting the config files and all and trying to use AutoModelForCausalLM but failed. Could you please guide me how to download and deploy using exllamav2

use huggingface-cl,install it in python env and activate the env, git clone has many issues

huggingface-cli download "$model_path" --revision $branch --local-dir /home/user/dev/models/"$model_name" --local-dir-use-symlinks False

Prasanthin

Jan 9, 2024

•

edited Jan 9, 2024

It worked @Ahmed Morsi...thank you so much

davideuler

Apr 13, 2024

Load 3.5bpw on my RTX 4090 with tabbyAPI, 8192 context, 83 tokens/s.
The generated code looks fine. It's pretty cool, way faster than GGUF models run under Mac M1 Ultra.