--- license: apache-2.0 inference: false --- # MistralLite-AWQ Model MistralLite-AWQ is a version of the [MistralLite](https://huggingface.co/amazon/MistralLite) model that was quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978). The MistralLite-AWQ models are approximately **70% smaller** than those of MistralLite whilst maintaining comparable performance. Please refer to the [original MistralLite model card](https://huggingface.co/amazon/MistralLite) for details about the model preparation and training processes. ## MistralLite-AWQ Variants | Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` | |--------|---:|---------------:|--------:|-----------| | [main](https://huggingface.co/amazon/MistralLite-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM | | [MistralLite-AWQ-64g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM | | [MistralLite-AWQ-32g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM | ## Dependencies - [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MistralLite model. - [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking. ## Evaluations ### Long Context The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise. #### Topic Retrieval See https://lmsys.org/blog/2023-06-29-longchat/ | Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 | |:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:| | _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ | | MistralLite | 100 | 100 | 100 | 100 | 98 | | **MistralLite-AWQ** | **100** | **100** | **100**| **100** | **98** | | **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | | **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | | Mistral-7B-Instruct-v0.1 | 96 | 52 | 2 | 0 | 0 | | Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | | Mixtral-8x7B-v0.1 | 0 | 0 | 0 | 0 | 0 | | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | #### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results) See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results | Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 | |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| | _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ | | MistralLite | 100 | 94 | 86 | 82 | 76 | 66 | | **MistralLite-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** | | **MistralLite-AWQ-64g-4b-GEMM** | **96**| **96**| **90** | **70** | **72**| **60** | | **MistralLite-AWQ-32g-4b-GEMM** | **98**| **96**| **84** | **76** | **70**| **62** | | Mistral-7B-Instruct-v0.1 | 96 | 56 | 38 | 36 | 30 | 30 | | Mistral-7B-Instruct-v0.2 | 100 | 100 | 96 | 98 | 96 | 84 | | Mixtral-8x7B-v0.1 | 54 | 38 | 56 | 66 | 62 | 38 | | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 | #### Pass Key Retrieval See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101 | Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 | |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| | _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ | | MistralLite | 100 | 100 | 100 | 100 | 100 | 100| | **MistralLite-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**| | **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| | **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| | Mistral-7B-Instruct-v0.1 | 100 | 50 | 30 | 20 | 10 | 10 | | Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | 100 | | Mixtral-8x7B-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 | | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 90 | 100 | 100 | #### QuALITY (Question Answering with Long Input Texts, Yes! See https://nyu-mll.github.io/quality/ |Model Name| Test set Accuracy | Hard subset Accuracy| |:----------|-------------:|-------------:| | MistralLite | 56.8 | 74.5 | | **MistralLite-AWQ** | **55.3** | **71.8** | | **MistralLite-AWQ-64g-4b-GEMM** | **55.2** | **72.9** | | **MistralLite-AWQ-32g-4b-GEMM** | **56.6** | **72.8** | | Mistral-7B-Instruct-v0.1 | 45.2 | 58.9 | | Mistral-7B-Instruct-v0.2 | 55.5 | 74 | | Mixtral-8x7B-v0.1 | 75 | 74.1 | | Mixtral-8x7B-Instruct-v0.1 | 68.7 | 83.3 | ## Usage ## Inference via vLLM HTTP Host ### Launch Host ```bash python -m vllm.entrypoints.openai.api_server \ --model amazon/MistralLite-AWQ \ --quantization awq ``` ### Query Host ```bash curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "amazon/MistralLite-AWQ", "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?<|assistant|>", "temperature": 0, "echo": false }' ``` ## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html) ```python from vllm import LLM, SamplingParams prompts = [ "<|prompter|>What are the main challenges to support a long context for LLM?<|assistant|>", ] sampling_params = SamplingParams(temperature=0, max_tokens=100) llm = LLM(model="amazon/MistralLite-AWQ") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ## License Apache 2.0 ## Limitations Before using the MistralLite-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.