ExLlamaV2 inference questions

#1
by shawei3000 - opened

I was trying to test speed using ExLlamaV2, but:

  1. I thought this model 5.0bpw should fit 48GB A6000 memory like other similar modles, but it expanded to another GPU memory, and totlally took ~ 60GB from 2 GPUs, is this right?
  2. do u have the prompt special token format, I looked into Huggingface, but probabaly is wrong: '<|beginoftext|><|im_start|> system<>user<|im_end|>; do you have the right format to use in ExLlamaV2 inference?

Thanks,
Jim

The raw size of the files is already 45 GB for the 5.0bpw version, and that's before the context length itself. So the 5.0bpw will definitely not fit in only 48 GB of VRAM.

The tokenizer_config.json file should point you to the right format:
https://huggingface.co/LoneStriker/Qwen2-72B-Instruct-5.0bpw-h6-exl2/blob/main/tokenizer_config.json

Many loaders should respect the chat_template field. I have not tested this specifically, but I assume these would automatically work:
https://github.com/turboderp/exui
https://github.com/PygmalionAI/aphrodite-engine

Thank you for your advice, and you great work here for community!

Sign up or log in to comment