File size: 16,872 Bytes
96fe658
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
# Qwen3 Best Practices
Discussion: [issue 4030](https://github.com/modelscope/ms-swift/issues/4030)

Qwen Documentation: [https://qwen.readthedocs.io/en/latest/training/ms_swift.html](https://qwen.readthedocs.io/en/latest/training/ms_swift.html)

## Inference

Thinking mode:

```shell
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --vllm_max_model_len 8192
```

```text
<<< who are you?
<think>
Okay, the user is asking "who are you?" Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and engaging in conversations. But I need to keep it concise. Also, the user might want to know how I can assist them. Maybe I should ask how I can help them today. Let me check if there's anything else important to include. Oh, I should make sure the tone is friendly and approachable. Alright, that should cover it.
</think>

Hello! I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, such as answering questions, creating content, writing stories, coding, and more. How can I help you today? 😊
<<< clear
<<< who are you? /no_think
<think>

</think>

I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I help you today?
```

Non-thinking mode:

- `--response_prefix` indicates that the model's output will continue after the prefix. It is equivalent to setting enable_thinking to False.

```shell
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --vllm_max_model_len 8192 \
    --response_prefix '<think>\n\n</think>\n\n'
```

```text
<<< who are you?
<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. I am designed to assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I assist you today?
```

## Training

Before starting training, please ensure that your environment is properly configured.

```bash
pip install ms-swift -U
pip install transformers

pip install deepspeed # for multi-GPU training
pip install liger-kernel # to save GPU memory resources
pip install flash-attn --no-build-isolation  # required for packing
```

## Supervised Fine-Tuning (SFT)

### Data Preparation

When using ms-swift for SFT, the custom dataset format is as follows (the `system` field is optional). You can organize it in JSON, JSONL, or CSV format. Specify `--dataset <dataset_path>` in the training script. For a complete guide on dataset formats, refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md).

```text
# General format
{"messages": [
    {"role": "system", "content": "<system-prompt>"},
    {"role": "user", "content": "<query1>"},
    {"role": "assistant", "content": "<response1>"}
]}
# Format with thinking process
{"messages": [
    {"role": "user", "content": "Where is the capital of Zhejiang?"},
    {"role": "assistant", "content": "Thought: ...\n\nAnswer:\nThe capital of Zhejiang is Hangzhou."}
]}
```

If you want to train using data without the thinking chain while preserving the model's reasoning ability, you can use one of the following methods to minimize the impact of fine-tuning:

**Option 1**: [Recommended] During training, specify `--loss_scale ignore_empty_think`, which will ignore the loss calculation for `Thought:` and `Answer:` tokens, thus avoiding the loss of reasoning capability. The training script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh). This method also works for models like DeepSeek-R1. The custom dataset format is as follows:

```json
{"messages": [
    {"role": "user", "content": "Where is the capital of Zhejiang?"},
    {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}
]}
```

**Option 2**: Add `/no_think` to the query in the dataset to avoid losing the reasoning capability. The training script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh). The custom dataset format is as follows:

```json
{"messages": [
    {"role": "user", "content": "Where is the capital of Zhejiang? /no_think"},
    {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}
]}
```

You can use the following command to obtain a distilled reasoning dataset. During training, you can mix it with datasets that do not contain chain-of-thought (CoT) data to further mitigate the loss of reasoning ability:

- The choice of `--val_dataset` is arbitrary. The reasoning results saved to `result_path` can be specified directly in training via `--dataset distill_dataset.jsonl`.
- This approach is also applicable to other reasoning models, such as deepseek-r1.

```shell
# 4 * 80GiB
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift infer \
    --model Qwen/Qwen3-32B \
    --infer_backend vllm \
    --val_dataset 'AI-ModelScope/alpaca-gpt4-data-en#5000' 'AI-ModelScope/alpaca-gpt4-data-zh#5000' \
    --vllm_gpu_memory_utilization 0.9 \
    --vllm_tensor_parallel_size 2 \
    --vllm_max_model_len 8192 \
    --max_new_tokens 4096 \
    --write_batch_size 1000 \
    --result_path distill_dataset.jsonl
```

### 30-Minute Self-Awareness Fine-Tuning

This section demonstrates how to perform self-awareness fine-tuning on Qwen3-8B within 30 minutes. A GPU with at least 22GB of VRAM is required and can run on the free computing resources provided by ModelScope, such as the A10 instance.

After training, the model will no longer identify itself as a "Qwen" trained by "Tongyi Lab," but rather as a "swift-robot" trained by "swift."

If you need to train in an offline environment, you can manually download the model and dataset, and specify `--model <model-path>` and `--dataset <dataset-dir>`. The dataset is available on the [ModelScope Hub](https://modelscope.cn/datasets/swift/self-cognition). You can view the preprocessing function for the `swift/self-cognition` dataset [here](https://github.com/modelscope/ms-swift/blob/36fdf381e5e88cb8a71c9d69c1d8936a989318cc/swift/llm/dataset/dataset/llm.py#L882).

For explanations of the parameters used in the training script, please refer to the [Command Line Arguments Documentation](../Instruction/Command-line-parameters.md).

```bash
# GPU Memory Usage: 22GB
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --model_author swift \
    --model_name swift-robot
```

After fine-tuning, you can test the results using the following script. Note that the `--adapters` part should be modified to point to the final saved checkpoint directory:

```bash
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
```

```text
<<< who are you?
<think>
Okay, the user asked, "who are you?" I need to introduce myself. Let me start by stating my name, swift-robot. Then, I should mention that I'm an AI assistant developed by swift. I should explain my purpose, which is to provide information and assistance. I should also highlight my capabilities, like answering questions, generating text, and engaging in conversation. It's important to keep the tone friendly and approachable. Maybe add something about being here to help and encourage the user to ask anything. Let me check if I covered all the key points: name, developer, purpose, capabilities, and a welcoming statement. Yeah, that should do it. Now, let me put that into a concise and friendly response.
</think>

Hello! I am swift-robot, an artificial intelligence assistant developed by swift. My purpose is to provide information and assistance to users like you. I can answer questions, generate text, and engage in conversations on a wide range of topics. I am here to help, so feel free to ask me anything you need!
```

By default, ms-swift uses the ModelScope community to download models and datasets. If you want to use the HuggingFace community instead, you need to additionally specify `--use_hf true`.

Merge LoRA weights:
```shell
swift export \
    --adapters output/checkpoint-xxx \
    --merge_lora true
```

Push the model to ModelScope/HuggingFace:

```shell
# If pushing full weights, change `--adapters` to `--model`.
# You can find your ModelScope hub_token here: https://modelscope.cn/my/myaccesstoken
swift export \
    --adapters output/checkpoint-xxx \
    --push_to_hub true \
    --hub_model_id '<hub-model-id>' \
    --hub_token '<hub-token>' \
    --use_hf false
```

If you want to perform training on multiple GPUs, the following example provides a multi-GPU training setup:

```shell
# 4 * 60GB
# You can run the experiment by setting `--dataset AI-ModelScope/alpaca-gpt4-data-en`
# Note: If you specify `--packing true`, you must also set `--attn_impl flash_attn`

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset '<your-dataset>' \
    --split_dataset_ratio 0.01 \
    --torch_dtype bfloat16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps 4 \
    --packing true \
    --eval_steps 100 \
    --save_steps 100 \
    --logging_steps 5 \
    --max_length 8192 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --save_total_limit 2 \
    --save_only_model true \
    --output_dir output \
    --deepspeed zero3 \
    --use_liger_kernel true \
    --attn_impl flash_attn
```

## Reinforcement Learning (RL)

ms-swift supports RLHF methods such as DPO, GRPO, DAPO, PPO, KTO, and GKD. This section will focus on using ms-swift for GRPO training on Qwen3-8B. For more information about GRPO, refer to the [GRPO documentation](../Instruction/GRPO/GetStarted/GRPO.md). Additional RLHF training scripts can be found in [examples/train/rlhf](https://github.com/modelscope/ms-swift/tree/main/examples/train/rlhf).

### Environment Setup

In addition to installing the dependencies related to ms-swift mentioned above, you also need to install the following:

```shell
pip install "math_verify==0.5.2"
pip install vllm==0.8.5.post1
```


### Data Preparation

The dataset format used for GRPO training with ms-swift is similar to that of SFT, but it does not require the final assistant's response part. If accuracy is used as the reward, an additional `solution` column is required to calculate accuracy.

Example dataset format:


```jsonl
{"messages": [{"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
```

For data preparation for other RLHF algorithms, please refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md#rlhf).

Notes on dataset requirements:

- **Reward Function Calculation**: The dataset format depends on the reward function being used. Additional columns may be needed to support specific reward calculations. For example:
  - When using built-in `accuracy` or `cosine` rewards, the dataset must include a `solution` column to calculate the accuracy of responses.
  - Other columns in the dataset will be passed as `**kwargs` to the reward function for further customization.
- **Custom Reward Functions**: To customize the reward function according to your specific needs, refer to: [External Reward Plugin](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/plugin). This plugin provides examples and templates for implementing custom reward functions.

We use AI-MO/NuminaMath-TIR as the dataset and compute the accuracy-based reward for model responses.

During training, we utilize vLLM to accelerate the sampling process.

```bash
# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset 'AI-MO/NuminaMath-TIR#5000' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_length 4096 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --deepspeed zero3 \
    --vllm_tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --log_completions true \
    --overlong_filter true
```

## Megatron-SWIFT

Best practice reference for single-node 8xH20 LoRA training with Qwen3-235B-A22B-Instruct-250718: https://github.com/modelscope/ms-swift/pull/5033.

ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).

For environment setup and conversion between HF and MCore model weights, refer to the [Megatron-SWIFT Training Documentation](../Instruction/Megatron-SWIFT-Training.md).

We will use Alibaba Cloud DLC to launch training. The training environment consists of two nodes equipped with 8x 80GiB A800 GPUs each. For more information on multi-node launching, see [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node).

```bash
# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Ensure that the weight save path `--save` and packing cache path `--packing_cache` are the same and shared across both nodes.
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --split_dataset_ratio 0.01 \
    --pipeline_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-3 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend flash
```


Training loss chart (partial):

<img width="910" alt="Image" src="https://github.com/user-attachments/assets/9fe393aa-8299-4659-aa2f-be5d44f0730b" />

Effect screenshot:

<img width="1066" alt="Image" src="https://github.com/user-attachments/assets/1a924130-1954-43e9-9093-b019aeef5949" />


The custom dataset format is the same as that used in `swift sft`. For details, see the previous sections. Simply specify `--dataset <dataset_path>`.

A comparison of training speed and GPU memory usage when performing full-parameter fine-tuning of the Qwen3-30B-A3B model using `megatron sft` and `swift sft` is shown below:

|          | Megatron-LM | DeepSpeed-ZeRO2 | DeepSpeed-ZeRO3 |
| -------- | ----------- | --------------- | --------------- |
| 训练速度 | 9.6s/it     | -               | 91.2s/it        |
| 显存使用 | 16 * 60GiB  | OOM             | 16 * 80GiB      |