Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

What is the VRAM requirement of this model?

#1
by Said2k - opened

What is the VRAM requirement of this model? I have 8 GB VRAM and I was wondering if this model could be run on that much?

If you have bitsandbytes, you should be able to load the model with load_in_8bit=Trueparam in your AutoModelForCausalLM func

Together org

I don't think VRAM 8GB is enough for this unfortunately (especially given that when we go to 32K, the size of KV cache becomes quite large too) -- we are pushing to decrease this! (e.g., we could do some KV cache quantization similar to what we have done in https://arxiv.org/abs/2303.06865, but it will take time)

In the meantime, you can go to https://api.together.xyz/playground to play with it!

Ce

How can we load the model using bitsandbytes ?

Together org

@BajrangWappnet , I think you can just do something like this:

model = AutoModelForCausalLM.from_pretrained(
    "togethercomputer/LLaMA-2-7B-32K", 
    trust_remote_code=False, 
    torch_dtype=torch.float16,
    load_in_8bit=True
)

Here's a more detailed example on how to use bitsandbytes: https://github.com/TimDettmers/bitsandbytes/blob/main/examples/int8_inference_huggingface.py

This comment has been hidden

Sign up or log in to comment