duyntnet
/

Qwen2-7B-Instruct-imatrix-GGUF

Text Generation

Qwen2-7B-Instruct

Model card Files Files and versions Community

Qwen2-7B-Instruct-imatrix-GGUF / README.md

duyntnet's picture

Upload README.md with huggingface_hub

0c167a7 verified 4 months ago

|

history blame contribute delete

No virus

3.82 kB

	---
	license: other
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- transformers
	- gguf
	- imatrix
	- Qwen2-7B-Instruct
	---
	Quantizations of https://huggingface.co/Qwen/Qwen2-7B-Instruct

	Note: you should use latest llama.cpp version with -fa switch to avoid garbage output.

	# From original readme

	## Requirements
	The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:
	```
	KeyError: 'qwen2'
	```

	## Quickstart

	Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	device = "cuda" # the device to load the model onto

	model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen2-7B-Instruct",
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

	prompt = "Give me a short introduction to large language model."
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)

	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	```

	### Processing Long Texts

	To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

	For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

	1. Install vLLM: You can install vLLM by running the following command.

	```bash
	pip install "vllm>=0.4.3"
	```

	Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).

	2. Configure Model Settings: After downloading the model weights, modify the `config.json` file by including the below snippet:
	```json
	{
	"architectures": [
	"Qwen2ForCausalLM"
	],
	// ...
	"vocab_size": 152064,

	// adding the following snippets
	"rope_scaling": {
	"factor": 4.0,
	"original_max_position_embeddings": 32768,
	"type": "yarn"
	}
	}
	```
	This snippet enable YARN to support longer contexts.

	3. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

	```bash
	python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights
	```

	Then you can access the Chat API by:

	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "Qwen2-7B-Instruct",
	"messages": [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Your Long Input Here."}
	]
	}'
	```

	For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).

	Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `rope_scaling` configuration only when processing long contexts is required.