sparse / ms-swift /docs /source_en /Instruction /Pre-training-and-Fine-tuning.md

Enxin

Upload folder using huggingface_hub

96fe658 verified about 1 month ago

preview code

raw

history blame contribute delete

28 kB

Pre-training and Fine-tuning

Training Capability:

Method	Full-Parameter	LoRA	QLoRA	Deepspeed	Multi-Node	Multi-Modal
Pre-training	✅	✅	✅	✅	✅	✅
Instruction Supervised Fine-tuning	✅	✅	✅	✅	✅	✅
DPO Training	✅	✅	✅	✅	✅	✅
GRPO Training	✅	✅	✅	✅	✅	✅
Reward Model Training	✅	✅	✅	✅	✅	✅
PPO Training	✅	✅	✅	✅	✅	❌
GKD Training	✅	✅	✅	✅	✅	✅
KTO Training	✅	✅	✅	✅	✅	✅
CPO Training	✅	✅	✅	✅	✅	✅
SimPO Training	✅	✅	✅	✅	✅	✅
ORPO Training	✅	✅	✅	✅	✅	✅
Classification Model Training	✅	✅	✅	✅	✅	✅
Embedding Model Training	✅	✅	✅	✅	✅	✅

Environment Preparation

Refer to the SWIFT installation documentation for recommended versions of third-party libraries.

pip install ms-swift -U

# If using deepspeed zero2/zero3
pip install deepspeed -U

Pre-training

Pre-training is done using the swift pt command, which will automatically use the generative template instead of the conversational template, meaning that use_chat_template is set to False (all other commands, such as swift sft/rlhf/infer, default use_chat_template to True). Additionally, swift pt has a different dataset format compared to swift sft, which can be referenced in the Custom Dataset Documentation.

You can refer to the CLI script for pre-training here. For more information on training techniques, please refer to the fine-tuning section.

Tips:

swift pt is equivalent to swift sft --use_chat_template false.
swift pt typically uses large datasets, and it is recommended to combine it with --streaming for streaming datasets.

Fine-tuning

ms-swift employs a hierarchical design philosophy, allowing users to perform fine-tuning through the command line interface, Web-UI interface, or directly using Python.

Using CLI

We provide best practices for self-cognition fine-tuning of Qwen2.5-7B-Instruct on a single 3090 GPU in 10 minutes; for details, refer to here. This can help you quickly understand SWIFT.

Additionally, we offer a series of scripts to help you understand the training capabilities of SWIFT:

Lightweight Training: Examples of lightweight fine-tuning supported by SWIFT can be found here. (Note: These methods can also be used for pre-training, but pre-training typically uses full parameter training.)
Distributed Training: SWIFT supports distributed training techniques, including: DDP, device_map, DeepSpeed ZeRO2/ZeRO3, and FSDP.
- device_map: Simplified model parallelism. If multiple GPUs are available, device_map will be automatically enabled. This evenly partitions the model layers across visible GPUs, significantly reducing memory consumption, although training speed may decrease due to serial processing.
- DDP + device_map: Models will be grouped and partitioned using device_map. Refer to here for details.
- DeepSpeed ZeRO2/ZeRO3: Save memory resources but may reduce training speed. ZeRO2 shards optimizer states and model gradients. ZeRO3 further shards model parameters on top of ZeRO2, saving even more memory but reducing training speed further. Refer to here for details.
- FSDP + QLoRA: Training a 70B model on two 3090 GPUs. Refer to here.
- Multi-node Multi-GPU Training: We have provided example shell scripts for launching multi-node runs using swift, torchrun, dlc, deepspeed, and accelerate. Except for dlc and deepspeed, the other launch scripts need to be started on all nodes to run properly. Please refer to here for details.
Quantization Training: Supports QLoRA training using quantization techniques such as GPTQ, AWQ, AQLM, BNB, HQQ, and EETQ. Fine-tuning a 7B model only requires 9GB of memory. For more details, refer to here.
Multi-modal Training: SWIFT supports pre-training, fine-tuning, and RLHF for multi-modal models. It supports tasks such as Captioning, VQA, OCR, and Grounding. It supports three modalities: images, videos, and audio. For more details, refer to here. The format for custom multi-modal datasets can be found in the Custom Dataset Documentation.
- For examples of using full-parameter training for ViT/Aligner, LoRA training for LLM, and employing different learning rates, refer to here.
- For multimodal model packing to increase training speed, refer to the example here.
RLHF Training: Refer to here. For multi-modal models, refer to here. For GRPO training, refer to here. For reinforcement fine-tuning, see here.
Megatron Training: Supports the use of Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and context parallelism. Refer to the Megatron-SWIFT Training Documentation.
Sequence Classification Model Training: Refer to here.
Embedding Model Training: Refer to here.
Agent Training: Refer to here.
Any-to-Any Model Training: Refer to here.
Other Capabilities:
- Streaming Data Reading: Reduces memory usage when handling large datasets. Refer to here.
- Packing: Combines multiple sequences into one, making each training sample as close to max_length as possible to improve GPU utilization. Refer to here.
- Long Text Training: Refer to here.
- Lazy Tokenize: Performs tokenization during training instead of pre-training (for multi-modal models, this avoids the need to load all multi-modal resources before training), which can reduce preprocessing wait times and save memory. Refer to here.

Tips:

When fine-tuning a base model to a chat model using LoRA technology with swift sft, you may sometimes need to manually set the template. Add the --template default parameter to avoid issues where the base model may fail to stop correctly due to encountering special characters in the dialogue template that it has not seen before. For more details, see here.
If you need to train in an offline environment, please set --model <model_dir> and --check_model false. If the corresponding model requires git clone from GitHub repositories, such as deepseek-ai/Janus-Pro-7B, please manually download the repository and set --local_repo_path <repo_dir>. For specific parameter meanings, refer to the command line parameter documentation.
Merging LoRA for models trained with QLoRA is not possible, so it is not recommended to use QLoRA for fine-tuning, as it cannot utilize vLLM/Sglang/LMDeploy for inference acceleration during inference and deployment. It is recommended to use LoRA or full parameter fine-tuning, merge them into complete weights, and then use GPTQ/AWQ/BNB for quantization.
If you are using an NPU for training, simply change CUDA_VISIBLE_DEVICES in the shell to ASCEND_RT_VISIBLE_DEVICES.
By default, SWIFT sets --gradient_checkpointing true during training to save memory, which may slightly slow down the training speed.
If you are using DDP for training and encounter the error: RuntimeError: Expected to mark a variable ready only once., please additionally set the parameter --gradient_checkpointing_kwargs '{"use_reentrant": false}' or use DeepSpeed for training.
To use DeepSpeed, you need to install it: pip install deepspeed -U. Using DeepSpeed can save memory but may slightly reduce training speed.
If your machine has high-performance GPUs like A100 and the model supports flash-attn, it is recommended to install flash-attn and set --attn_impl flash_attn, as this will accelerate training and inference while slightly reducing memory usage.

How to debug:

You can use the following method for debugging, which is equivalent to using the command line for fine-tuning, but this method does not support distributed training. You can refer to the entry point for the fine-tuning command line here.

from swift.llm import sft_main, TrainArguments
result = sft_main(TrainArguments(
    model='Qwen/Qwen2.5-7B-Instruct',
    train_type='lora',
    dataset=['AI-ModelScope/alpaca-gpt4-data-zh#500',
             'AI-ModelScope/alpaca-gpt4-data-en#500',
             'swift/self-cognition#500'],
    torch_dtype='bfloat16',
    # ...
))

Using Web-UI

If you want to use the interface for training, you can refer to the Web-UI documentation.

Using Python

For the Qwen2.5 self-cognition fine-tuning notebook, see here.
For the Qwen2VL OCR task notebook, see here.

Merge LoRA

See here.

Inference (Fine-Tuned Model)

To perform inference on a LoRA-trained checkpoint using the CLI:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --infer_backend pt \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

The adapters folder contains the trained parameter file args.json, so there is no need to specify --model or --system explicitly; Swift will automatically read these parameters. If you want to disable this behavior, you can set --load_args false.
If you are using full parameter training, please use --model instead of --adapters to specify the training checkpoint directory. For more information, refer to the Inference and Deployment documentation.
You can use swift app instead of swift infer for interactive inference.
You can choose to merge LoRA (by additionally specifying --merge_lora true), and then specify --infer_backend vllm/sglang/lmdeploy for inference acceleration.

For batch inference on the validation set of the dataset:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --infer_backend pt \
    --temperature 0 \
    --max_new_tokens 2048 \
    --load_data_args true \
    --max_batch_size 1

You can set --max_batch_size 8 to enable batch processing with --infer_backend pt. If you use infer_backend vllm/sglang/lmdeploy, it will automatically handle batching without needing to specify.
--load_data_args true will additionally read the data parameters from the training storage parameter file args.json.

If you want to perform inference on an additional test set instead of using the training validation set, use --val_dataset <dataset_path> for inference:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --infer_backend pt \
    --temperature 0 \
    --max_new_tokens 2048 \
    --val_dataset <dataset-path> \
    --max_batch_size 1

Example of Inference on LoRA-Trained Model Using Python:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    PtEngine, RequestConfig, safe_snapshot_download, get_model_tokenizer, get_template, InferRequest
)
from swift.tuners import Swift
# Please adjust the following lines
model = 'Qwen/Qwen2.5-7B-Instruct'
lora_checkpoint = safe_snapshot_download('swift/test_lora')  # Change to your checkpoint_dir
template_type = None  # None: use the default template_type of the corresponding model
default_system = "You are a helpful assistant."  # None: use the default system prompt of the corresponding model

# Load model and dialogue template
model, tokenizer = get_model_tokenizer(model)
model = Swift.from_pretrained(model, lora_checkpoint)
template_type = template_type or model.model_meta.template
template = get_template(template_type, tokenizer, default_system=default_system)
engine = PtEngine.from_model_template(model, template, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)

# Using 2 infer_requests to demonstrate batch inference
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': 'Where is the capital of Zhejiang?'},
                           {'role': 'assistant', 'content': 'Where is the capital of Zhejiang?'},
                           {'role': 'user', 'content': 'What is good to eat here?'},]),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')

Example of LoRA Inference for Multi-Modal Model:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    PtEngine, RequestConfig, safe_snapshot_download, get_model_tokenizer, get_template, InferRequest
)
from swift.tuners import Swift
# Please adjust the following lines
model = 'Qwen/Qwen2.5-VL-7B-Instruct'
lora_checkpoint = safe_snapshot_download('swift/test_grounding')  # Change to your checkpoint_dir
template_type = None  # None: use the default template_type of the corresponding model
default_system = None  # None: use the default system prompt of the corresponding model

# Load model and dialogue template
model, tokenizer = get_model_tokenizer(model)
model = Swift.from_pretrained(model, lora_checkpoint)
template_type = template_type or model.model_meta.template
template = get_template(template_type, tokenizer, default_system=default_system)
engine = PtEngine.from_model_template(model, template, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)

# Using 2 infer_requests to demonstrate batch inference
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': '<image>Task: Object Detection'}],
                 images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')

If you are using a model trained with ms-swift, you can obtain the training configuration as follows:

from swift.llm import safe_snapshot_download, BaseArguments

lora_adapters = safe_snapshot_download('swift/test_lora')
args = BaseArguments.from_pretrained(lora_adapters)
print(f'args.model: {args.model}')
print(f'args.model_type: {args.model_type}')
print(f'args.template_type: {args.template}')
print(f'args.default_system: {args.system}')

To perform inference on a checkpoint trained with full parameters, set model to checkpoint_dir and lora_checkpoint to None. For more information, refer to the Inference and Deployment documentation.
For streaming inference and acceleration using VllmEngine, SglangEngine and LmdeployEngine, you can refer to the inference examples for large models and multi-modal large models.
For inference on fine-tuned models using the Hugging Face transformers/PEFT ecosystem, you can see here.
If you have trained multiple LoRAs and need to switch among them, refer to the inference and deployment examples.
For grounding tasks in multi-modal models, you can refer to here.
For inference on a LoRA fine-tuned BERT model, see here.

Deployment (Fine-Tuned Model)

Use the following command to start the deployment server. If the weights are trained using full parameters, please use --model instead of --adapters to specify the training checkpoint directory. You can refer to the client calling methods described in the Inference and Deployment documentation: curl, OpenAI library, and Swift client.

CUDA_VISIBLE_DEVICES=0 \
swift deploy \
    --adapters output/vx-xxx/checkpoint-xxx \
    --infer_backend pt \
    --temperature 0 \
    --max_new_tokens 2048 \
    --served_model_name '<model-name>'

Here, a complete example of deploying and calling multiple LoRAs using vLLM will be provided.

Server Side

First, you need to install vLLM: pip install vllm -U, and use --infer_backend vllm when deploying, which can significantly speed up inference.

We pre-trained two base models with different self-awareness LoRA incremental weights for Qwen/Qwen2.5-7B-Instruct (which can run successfully). You can find relevant information in args.json. You simply need to modify --adapters to specify the local path for the trained LoRA weights during deployment.

CUDA_VISIBLE_DEVICES=0 \
swift deploy \
    --adapters lora1=swift/test_lora lora2=swift/test_lora2 \
    --infer_backend vllm \
    --temperature 0 \
    --max_new_tokens 2048

Client Side

Here, we will only cover calling using the OpenAI library. Examples for calling with curl and the Swift client can be referenced in the Inference and Deployment documentation.

from openai import OpenAI

client = OpenAI(
    api_key='EMPTY',
    base_url=f'http://127.0.0.1:8000/v1',
)
models = [model.id for model in client.models.list().data]
print(f'models: {models}')

query = 'who are you?'
messages = [{'role': 'user', 'content': query}]

resp = client.chat.completions.create(model=models[1], messages=messages, max_tokens=512, temperature=0)
query = messages[0]['content']
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')

gen = client.chat.completions.create(model=models[2], messages=messages, stream=True, temperature=0)
print(f'query: {query}\nresponse: ', end='')
for chunk in gen:
    if chunk is None:
        continue
    print(chunk.choices[0].delta.content, end='', flush=True)
print()
"""
models: ['Qwen2.5-7B-Instruct', 'lora1', 'lora2']
query: who are you?
response: I am an artificial intelligence model named swift-robot, developed by swift. I can answer your questions, provide information, and engage in conversation. If you have any inquiries or need assistance, feel free to ask me at any time.
query: who are you?
response: I am an artificial intelligence model named Xiao Huang, developed by ModelScope. I can answer your questions, provide information, and engage in conversation. If you have any inquiries or need assistance, feel free to ask me at any time.
"""