NousResearch/Yarn-Llama-2-13b-64k · Hardware requirements for the model.

Sep 3, 2023

Hi there, Ive been trying to do inference on a LLAMA2 with higher context window on a T4 GPU from google colab. Both (this and the 32k version from togethercompute) always crash the instance because of RAM, even with QLORA. Is there some kind of formula to calculate the hardware requirements for models with increased CW or any proven configurations that work? Thanks in advance

bloc97

NousResearch org Sep 3, 2023

QLoRA is used for training, do you mean quantization? The T4 GPU's memory is rather small (16GB), thus you will be restricted to <10k context. For the full 128k context with 13b model, it's ~360GB of VRAM (or RAM if using CPU inference) for fp16 inference. Quantization doesn't affect the context size memory requirements very much...

At 64k context you might be looking at somewhere in the neighborhood of ~100GB of memory...

Sc0urge

Sep 4, 2023

QLoRA is used for training, do you mean quantization? The T4 GPU's memory is rather small (16GB), thus you will be restricted to <10k context. For the full 128k context with 13b model, it's ~360GB of VRAM (or RAM if using CPU inference) for fp16 inference. Quantization doesn't affect the context size memory requirements very much...

At 64k context you might be looking at somewhere in the neighborhood of ~100GB of memory...

My bad, what I meant was, that when I loaded the model from HF with bits and bytes (which uses QLoRa I believe) it already crashed.