Edit model card

Assistant Llama 2 7B Chat AWQ

This model is a quantitized export of wasertech/assistant-llama2-7b-chat using AWQ.

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is also now supported by continuous batching server vLLM, allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios.

As of September 25th 2023, preliminary Llama-only AWQ support has also been added to Huggingface Text Generation Inference (TGI).

Downloads last month
5

Dataset used to train wasertech/assistant-llama2-7b-chat-awq