|
--- |
|
license: other |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- transformers |
|
- gguf |
|
- imatrix |
|
- Qwen2-7B-Instruct |
|
--- |
|
Quantizations of https://huggingface.co/Qwen/Qwen2-7B-Instruct |
|
|
|
**Note: you should use latest llama.cpp version with -fa switch to avoid garbage output.** |
|
|
|
# From original readme |
|
|
|
## Requirements |
|
The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: |
|
``` |
|
KeyError: 'qwen2' |
|
``` |
|
|
|
## Quickstart |
|
|
|
Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
device = "cuda" # the device to load the model onto |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"Qwen/Qwen2-7B-Instruct", |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct") |
|
|
|
prompt = "Give me a short introduction to large language model." |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
|
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=512 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
``` |
|
|
|
### Processing Long Texts |
|
|
|
To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. |
|
|
|
For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps: |
|
|
|
1. **Install vLLM**: You can install vLLM by running the following command. |
|
|
|
```bash |
|
pip install "vllm>=0.4.3" |
|
``` |
|
|
|
Or you can install vLLM from [source](https://github.com/vllm-project/vllm/). |
|
|
|
2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet: |
|
```json |
|
{ |
|
"architectures": [ |
|
"Qwen2ForCausalLM" |
|
], |
|
// ... |
|
"vocab_size": 152064, |
|
|
|
// adding the following snippets |
|
"rope_scaling": { |
|
"factor": 4.0, |
|
"original_max_position_embeddings": 32768, |
|
"type": "yarn" |
|
} |
|
} |
|
``` |
|
This snippet enable YARN to support longer contexts. |
|
|
|
3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command: |
|
|
|
```bash |
|
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights |
|
``` |
|
|
|
Then you can access the Chat API by: |
|
|
|
```bash |
|
curl http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "Qwen2-7B-Instruct", |
|
"messages": [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": "Your Long Input Here."} |
|
] |
|
}' |
|
``` |
|
|
|
For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2). |
|
|
|
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required. |