Update README.md
Browse files
README.md
CHANGED
|
@@ -56,6 +56,31 @@ content: A large language model (LLM) is a type of artificial intelligence syste
|
|
| 56 |
"""
|
| 57 |
```
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
### Generate the model
|
| 60 |
```bash
|
| 61 |
auto_round --model Qwen/Qwen3-Next-80B-A3B-Instruct --scheme W4A16 --output_dir tmp_autoround
|
|
|
|
| 56 |
"""
|
| 57 |
```
|
| 58 |
|
| 59 |
+
### vLLM
|
| 60 |
+
The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens.
|
| 61 |
+
```shell
|
| 62 |
+
vllm serve Intel/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound --port 8000 --max-model-len 262144
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
The following command is recommended for MTP with the rest settings the same as above:
|
| 66 |
+
```shell
|
| 67 |
+
vllm serve Intel/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound --port 8000 --max-model-len 262144 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
curl -noproxy '*' http://localhost::8000/v1/chat/completions \
|
| 72 |
+
-H "Content-Type: application/json" \
|
| 73 |
+
-d '{
|
| 74 |
+
"messages": [
|
| 75 |
+
{"role": "user", "content": "Give me a short introduction to large language model."}
|
| 76 |
+
],
|
| 77 |
+
"max_tokens": 1024
|
| 78 |
+
}'
|
| 79 |
+
|
| 80 |
+
# "content":
|
| 81 |
+
# "A large language model (LLM) is a type of artificial intelligence system trained on vast amounts of text data to understand, generate, and manipulate human language. These models use deep learning architectures—often based on the transformer network—to predict the next word in a sequence, enabling them to perform tasks like answering questions, writing essays, translating languages, and even coding. LLMs, such as GPT, Gemini, and Claude, learn patterns and relationships in language without explicit programming, allowing them to produce human-like responses across a wide range of topics. While powerful, they don’t “understand” language in the human sense and can sometimes generate plausible-sounding but incorrect or biased information.",
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
### Generate the model
|
| 85 |
```bash
|
| 86 |
auto_round --model Qwen/Qwen3-Next-80B-A3B-Instruct --scheme W4A16 --output_dir tmp_autoround
|