hzhwcmhf commited on
Commit
1436d4c
1 Parent(s): 1c79735

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -58
README.md CHANGED
@@ -15,7 +15,9 @@ Qwen2 is the new series of Qwen large language models. For Qwen2, we release a n
15
 
16
  Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.
17
 
18
- Qwen2-57B-A14B-Instruct-GPTQ-Int4 supports a context length of up to 65,536 tokens, enabling the processing of extensive inputs. Please refer to [this section](#processing-long-texts) for detailed instructions on how to deploy Qwen2 for handling long texts.
 
 
19
 
20
  For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).
21
  <br>
@@ -71,63 +73,6 @@ generated_ids = [
71
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
72
  ```
73
 
74
- ### Processing Long Texts
75
-
76
- To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
77
-
78
- For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:
79
-
80
- 1. **Install vLLM**: You can install vLLM by running the following command.
81
-
82
- ```bash
83
- pip install "vllm>=0.4.3"
84
- ```
85
-
86
- Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
87
-
88
- 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
89
- ```json
90
- {
91
- "architectures": [
92
- "Qwen2MoeForCausalLM"
93
- ],
94
- // ...
95
- "vocab_size": 152064,
96
-
97
- // adding the following snippets
98
- "rope_scaling": {
99
- "factor": 2.0,
100
- "original_max_position_embeddings": 32768,
101
- "type": "yarn"
102
- }
103
- }
104
- ```
105
- This snippet enable YARN to support longer contexts.
106
-
107
- 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
108
-
109
- ```bash
110
- python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-57B-A14B-Instruct-GPTQ-Int4 --model path/to/weights
111
- ```
112
-
113
- Then you can access the Chat API by:
114
-
115
- ```bash
116
- curl http://localhost:8000/v1/chat/completions \
117
- -H "Content-Type: application/json" \
118
- -d '{
119
- "model": "Qwen2-57B-A14B-Instruct-GPTQ-Int4",
120
- "messages": [
121
- {"role": "system", "content": "You are a helpful assistant."},
122
- {"role": "user", "content": "Your Long Input Here."}
123
- ]
124
- }'
125
- ```
126
-
127
- For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
128
-
129
- **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
130
-
131
  ## Benchmark and Speed
132
 
133
  To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.
 
15
 
16
  Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.
17
 
18
+ **Note: vLLM does not support the GPTQ version of Qwen2MoeForCausalLM currently.**
19
+
20
+ Qwen2-57B-A14B-Instruct supports a context length of up to 65,536 tokens, enabling the processing of extensive inputs. However, since vLLM currently does not support this model (Qwen2-57B-A14B-Instruct-GPTQ-Int4), please refer to [Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct).
21
 
22
  For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).
23
  <br>
 
73
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
74
  ```
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ## Benchmark and Speed
77
 
78
  To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.