Edit model card

MistralLite-AWQ Model

MistralLite-AWQ is a version of the MistralLite model that was quantized using the AWQ method developed by Lin et al. (2023). The MistralLite-AWQ models are approximately 70% smaller than those of MistralLite whilst maintaining comparable performance.

Please refer to the original MistralLite model card for details about the model preparation and training processes.

MistralLite-AWQ Variants

Branch Approx. Model Size q_group_size w_bit version
main 3.9 GB 128 4 GEMM
MistralLite-AWQ-64g-4b-GEMM 4.0 GB 64 4 GEMM
MistralLite-AWQ-32g-4b-GEMM 4.3 GB 32 4 GEMM

Dependencies

Evaluations

Long Context

The following benchmark results are shown as accuracy (%) values, unless stated otherwise.

Topic Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/

Model Name n_topics=05 n_topics=10 n_topics=15 n_topics=20 n_topics=25
n_tokens (approx.) = 3048 5966 8903 11832 14757
MistralLite 100 100 100 100 98
MistralLite-AWQ 100 100 100 100 98
MistralLite-AWQ-64g-4b-GEMM 100 100 100 100 98
MistralLite-AWQ-32g-4b-GEMM 100 100 100 100 98
Mistral-7B-Instruct-v0.1 96 52 2 0 0
Mistral-7B-Instruct-v0.2 100 100 100 100 100
Mixtral-8x7B-v0.1 0 0 0 0 0
Mixtral-8x7B-Instruct-v0.1 100 100 100 100 100

Line Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results

Model Name n_lines=200 n_lines=300 n_lines=400 n_lines=500 n_lines=600 n_lines=680
n_tokens (approx.) = 4317 6415 8510 10610 12698 14373
MistralLite 100 94 86 82 76 66
MistralLite-AWQ 96 94 88 80 70 62
MistralLite-AWQ-64g-4b-GEMM 96 96 90 70 72 60
MistralLite-AWQ-32g-4b-GEMM 98 96 84 76 70 62
Mistral-7B-Instruct-v0.1 96 56 38 36 30 30
Mistral-7B-Instruct-v0.2 100 100 96 98 96 84
Mixtral-8x7B-v0.1 54 38 56 66 62 38
Mixtral-8x7B-Instruct-v0.1 100 100 100 100 100 100

Pass Key Retrieval

See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101

Model Name n_garbage=12000 n_garbage=20000 n_garbage=31000 n_garbage=38000 n_garbage=45000 n_garbage=60000
n_tokens (approx.) = 3272 5405 8338 10205 12071 16072
MistralLite 100 100 100 100 100 100
MistralLite-AWQ 100 100 100 100 100 100
MistralLite-AWQ-64g-4b-GEMM 100 100 100 100 100 100
MistralLite-AWQ-32g-4b-GEMM 100 100 100 100 100 100
Mistral-7B-Instruct-v0.1 100 50 30 20 10 10
Mistral-7B-Instruct-v0.2 100 100 100 100 100 100
Mixtral-8x7B-v0.1 100 100 100 100 100 100
Mixtral-8x7B-Instruct-v0.1 100 100 100 90 100 100

QuALITY (Question Answering with Long Input Texts, Yes!)

See https://nyu-mll.github.io/quality/

Model Name Test set Accuracy Hard subset Accuracy
MistralLite 56.8 74.5
MistralLite-AWQ 55.3 71.8
MistralLite-AWQ-64g-4b-GEMM 55.2 72.9
MistralLite-AWQ-32g-4b-GEMM 56.6 72.8
Mistral-7B-Instruct-v0.1 45.2 58.9
Mistral-7B-Instruct-v0.2 55.5 74
Mixtral-8x7B-v0.1 75 74.1
Mixtral-8x7B-Instruct-v0.1 68.7 83.3

Usage

Inference via vLLM HTTP Host

Launch Host

python -m vllm.entrypoints.openai.api_server \
    --model amazon/MistralLite-AWQ \
    --quantization awq

Query Host

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "amazon/MistralLite-AWQ",
          "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
          "temperature": 0,
          "echo": false
    }'

Inference via vLLM Offline Inference

from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="amazon/MistralLite-AWQ")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

License

Apache 2.0

Limitations

Before using the MistralLite-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.

Downloads last month
40
Safetensors
Model size
1.2B params
Tensor type
I32
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.