File size: 11,372 Bytes
7eba212 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: apache-2.0
inference: false
---
# MegaBeam-Mistral-7B-300k-AWQ Model
MegaBeam-Mistral-7B-300k-AWQ is a version of the [MegaBeam-Mistral-7B-300k](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) model that was
quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978).
The MegaBeam-Mistral-7B-300k-AWQ models are approximately **70% smaller** than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance.
Please refer to the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) for details about the model
preparation and training processes.
## MegaBeam-Mistral-7B-300k Variants
| Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` |
|--------|---:|---------------:|--------:|-----------|
| [main](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM |
| [MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM |
| [MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM |
## Dependencies
- [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MegaBeam-Mistral-7B-300k model.
- [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking.
## Evaluations
### InfiniteBench
This benchmark was developed by [Zhang et al. (2024)](https://arxiv.org/abs/2402.13718), available from https://github.com/OpenBMB/InfiniteBench.
See the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k)
for more details.
| Task Name | MegaBeam-Mistral-7B-300k-AWQ | MegaBeam-Mistral-7B-300k | Mistral-7B-Instruct-v0.2 | Llama-3-8B-Instruct-262k | Llama3-70B-1M | GPT-4-1106-preview | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K |
|------------------|------------------------------|--------------------------|--------------------------|--------------------------|---------------|--------------------|-----------------|-----------|----------|------------|-------------|------------------|
| Retrieve.PassKey | 100% | 100% | 75.76% | 98.30% | 81.35% | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% |
| Retrieve.Number | 92.7% | 96.10% | 25.25% | 97.79% | 97.62% | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% |
| Retrieve.KV | 0% | 0% | 0% | 3.40% | 3% | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% |
| En.Sum | 29.05% | 29.39% | 22.13% | 16.40% | 20.72% | 14.73% | 9.09% | 17.93% | 14.45% | < 5% | < 5% | < 5% |
| En.QA | 15.69% | 14.93% | 4.93% | 13.20% | 16.52% | 22.22% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% | < 5% |
| En.MC | 48.91% | 51.52% | 7.80% | 50.65% | 62% | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% | 38.43% | 10.48% |
| En.Dia | 11.50% | 9.50% | 3.50% | 1% | 12.50% | 8.50% | 7.50% | 11.50% | 46.50% | < 5% | < 5% | < 5% |
| Zh.QA | 10.53% | 10.71% | 3.43% | 19.02% | 26% | 25.96% | 14.43% | 17.93% | 9.64% | 15.07% | 13.61% | < 5% |
| Code.Debug | 21.83% | 27.41% | 11.60% | 22.08% | 23.85% | 39.59% | < 5% | 18.02% | < 5% | < 5% | < 5% | < 5% |
| Code.Run | 1.25% | 1.75% | 0.25% | 0% | 0% | 23.25% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
| Math.Calc | 0% | 0% | 0% | 0% | 0% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
| Math.Find | 20.57% | 24.28% | 26.28% | 15.40% | 30% | 60.00% | 17.14% | 12.57% | 32.29% | < 5% | 25.71% | 7.71% |
| **Average** | 29.34% | 30.70% | 15.08% | 28.10% | 31.13% | 46.08% | 20.41% | 34.93% | 37.21% | 22.78% | 25.41% | 17.59% |
### Long Context
The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise.
#### Topic Retrieval
See https://lmsys.org/blog/2023-06-29-longchat/
| Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
|:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:|
| _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ |
| MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 |
| **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100** | **100**| **100** | **100** |
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
#### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results)
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
| Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
| _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ |
| MegaBeam-Mistral-7B-300k | 98 | 98 | 92 | 98 | 90 | 90 |
| **MegaBeam-Mistral-7B-300k-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** |
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100**| **98**| **96** | **96** | **90**| **94** |
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **98**| **98**| **82** | **96** | **92**| **90** |
#### Pass Key Retrieval
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
| Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
| _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ |
| MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 | 100|
| **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**|
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
#### QuALITY (Question Answering with Long Input Texts, Yes!)
See https://nyu-mll.github.io/quality/
|Model Name| Test set Accuracy | Hard subset Accuracy|
|:----------|-------------:|-------------:|
| MegaBeam-Mistral-7B-300k | 53.2 | 72 |
| **MegaBeam-Mistral-7B-300k-AWQ** | **51.3** | **71.3** |
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **52.4** | **72.1** |
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **53.1** | **71.3** |
## Usage
## Inference via vLLM HTTP Host
### Launch Host
```bash
python -m vllm.entrypoints.openai.api_server \
--model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \
--quantization awq
```
### Query Host
```bash
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
"temperature": 0,
"echo": false
}'
```
## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html)
```python
from vllm import LLM, SamplingParams
prompts = [
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)
llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## License
Apache 2.0
## Limitations
Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own
independent assessment, and take measures to ensure that your use would comply
with your own specific quality control practices and standards, and that your
use would comply with the local rules, laws, regulations, licenses and terms
that apply to you, and your content.
|