Min hardware requirements

#3
by narvind2003 - opened

Could you please add the minimum hardware requirements to run this Instruct model?

I have tried using a 4090 24G, but it didn't work... πŸ’”πŸ’”πŸ’” we need more RAM.

β€œ>130GB required”

I believe you can find this information here: https://docs.mistral.ai/models/
(min. 100GB GPU RAM)

I believe you can find this information here: https://docs.mistral.ai/models/
(min. 100GB GPU RAM)

I think we may not be able to utilize such a large amount (100 GB) with 'load_in_4bits' or 'load_in_8bit.' I will attempt it with the A100 80GB. ^^

Load in 4 bits should work. Same as 8bit!

Quick math (approximate): 45 billion parameters

  • In 4-bits -> 180 trillion bits, that's 22.5GB of VRAM required
  • in 8-bits -> 45GB of VRAM
  • in half-precision -> 90GB of VRAM required

Note that 4-bits is presenting high quality degradation. It might be interesting to explore only quantizing the experts. https://arxiv.org/abs/2310.16795, for example, introduces QMoE which allows sub-1-bit quantization for MoEs. @timdettmers is also exploring this topic, so I'm waiting for exciting things in the incoming days!

I have tested it on a VM with 2 GPU A10 (23 + 23 GB GPU) it works if using load_in_4bit, not 8bit. Performance in Italian are interesting

RTX 4090, 24GB dedicated GPU memory + 32 GB shared GPU memory, Windows 11, WSL (Ubuntu):

  1. from_pretrained(load_in_8bits=True)
    45.8 GB
  2. from_pretrained(load_in_4bits=True)
    27.1 GB

@JayBZD how to run with shared memory . I have 4090 and 64gb system memory available

I run it (Q6_K) on CPU only, it's much faster than 70B Models, but consumes over 50 percent from my 64 GB RAM.

If we would like to run at full precision, we will need 45 * 32 = 1440B bits -> 180 GB ?
Are the parameters in float32 by default?

On RTX A6000(48GB)
load_in_4bit = 27.2GB
load_in_8bit = 45.4 GB

How do I load in 4bit when using the transformers library? Or do I load it another way?

Pass β€œload_in_bit = True” in model.from_pretrained()

What is the required RAM|CPU|GPU? to run this model and the Q4_K_M version of the model?

You can look at this: https://arxiv.org/abs/2312.17238

For fast inference with ~24GB RAM+VRAM in colab look at this: https://colab.research.google.com/github/dvmazur/mixtral-offloading

Sign up or log in to comment