--- license: other language: - en pipeline_tag: text-generation inference: false tags: - transformers - gguf - imatrix - Qwen2-7B-Instruct --- Quantizations of https://huggingface.co/Qwen/Qwen2-7B-Instruct **Note: you should use latest llama.cpp version with -fa switch to avoid garbage output.** # From original readme ## Requirements The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: ``` KeyError: 'qwen2' ``` ## Quickstart Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents. ```python from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2-7B-Instruct", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct") prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ### Processing Long Texts To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps: 1. **Install vLLM**: You can install vLLM by running the following command. ```bash pip install "vllm>=0.4.3" ``` Or you can install vLLM from [source](https://github.com/vllm-project/vllm/). 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet: ```json { "architectures": [ "Qwen2ForCausalLM" ], // ... "vocab_size": 152064, // adding the following snippets "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } } ``` This snippet enable YARN to support longer contexts. 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command: ```bash python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights ``` Then you can access the Chat API by: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Your Long Input Here."} ] }' ``` For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2). **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.