# Frequently-asked-questions Here are some common questions encountered during the use of Swift. ## Training ### Q1: What models and datasets are supported for fine-tuning in Swift? Please refer to the documentation on [Supported Models and Datasets](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html). ### Q2: What data formats are supported when training with custom datasets? For custom dataset formats, see the documentation on [Custom Dataset](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html). ### Q3: What is the format for dataset_info.json for custom datasets, and how can I use it? The dataset_info.json format can be found in the documentation on [Custom Dataset](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html). Use the command line with `--custom_dataset_info xxx.json`, `--dataset `. ### Q4: How can I train with a custom dataset using the interface? Using a custom dataset through the interface is the same as using the command line. Refer to the documentation on [Custom Dataset](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html). ### Q5: Can I write a line in the jsonl file like this? {"index": "00000", "query": "11111", "response": "22222", 'source':'qqq'} You can have additional fields that won't be used. ### Q6: Where can I find the command line parameters? Please refer to the documentation on [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q7: What parameters need to be configured for training in an offline environment? Use `--model local_path`, `--check_model false`. For more details, see the [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q8: Where can I check model_type? Check the [Supported Models and Datasets](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html). ### Q9: Can I directly convert the model to gguf format after training? Currently, only export to ModelFile is supported. See the [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q10: Does Swift support pre-training? I only see SFT. Yes, it supports it. Use the command line `swift pt`, [pt example](https://github.com/modelscope/ms-swift/tree/main/examples/train/pretrain). The dataset format is detailed in [Custom Dataset](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html). ### Q11: For models fine-tuned with LoRA, should I merge them into one model for resuming training, or can I specify the original model and LoRA block by path directly? You do not need to merge. Use `--resume_from_checkpoint output/xxx/vx-xxx/checkpoint-xxx`. See the [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q12: I would like to control the location where the original model weights downloaded from the internet are stored. How can I place the original model in a specific folder? You can set the environment variable `MODELSCOPE_CACHE=your_path` to store the original model in the specified path. For SDK downloads, use `cache_dir="local_path"`. You can also use the `modelscope download` command-line tool or `git` to download it. For details, refer to the [Download Model](https://modelscope.cn/docs/Models/Download-Model). During training, set `--model` to the local path. For offline training, configure `--check_model false`. See the [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q13: Has anyone encountered this issue with ms-swift? ```text [rank6]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig [rank6]: stage3_prefetch_bucket_size [rank6]: Input should be a valid integer, got a number with a fractional part [type=int_from_float,input_value=11560550.4,in put_type=float] [rank6]: For further information visit https://errors.pydantic.dev/2.8/v/int_fro_float ``` Downgrade `deepspeed` to `0.14.*`. ### Q14: Is there a complete tutorial and command line for fine-tuning Qwen-2-VL? Reference the [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) for multimodal model training. ### Q15: Are there any tricks supported for fine-tuning multi-modal large models, similar to the LLM's neftune? You can try variations of `lora` like `piassa/olora/dora`, or `fourierft`. Refer to the tricks in the `sft` parameters, as some may not apply to multi-modal. ### Q16: The accuracy from eval during training and the accuracy computed from re-inference with the saved checkpoint are not consistent. The methods for calculating eval accuracy during training and inference are different. The default `acc_strategy` is `token`, and the selectable values are: `token`, `seq`. ### Q17: Official Magic Mirror image and Swift environment. You can start the container using `docker run`, for example: `docker run --gpus all -p 8000:8000 -it -d --name ms modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-1.26.0-LLM /bin/bash`. After starting the container, pull the latest code to install Swift. Additionally, for large model training scenarios, the `ms-swift` image is provided, which includes additional dependencies for `Megatron-SWIFT`, such as: `modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.26.0-swift3.4.1.post1`. For more details, refer to the [Swift installation documentation](https://swift.readthedocs.io/en/latest/GetStarted/SWIFT-installation.html). ### Q18: Command line for multi-machine multi-card training. For details, see the [Multi-node Example](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node). ### Q19: How to choose a template? See [issue](https://github.com/modelscope/ms-swift/issues/1813). ### Q20: How to use torchrun and swift sft for multi-card training? `swift sft` uses `torchrun`. ### Q21: I have a question about my SFT dataset being too large; tokenizing takes a long time. Is there a solution? Use `lazy_tokenize`or stream reading (`streaming`). See [Command Line Parameters documentation](https://swift.readthedocs.io/zh-cn/latest/Instruction/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.html). ### Q22: When two datasets are simply appended together in the training set, does the model shuffle internally during training, or does it take data in order to train? Command-line parameter `dataset_shuffle`. For more details, see the [command-line parameters documentation](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q23: If the model is on two cards and the data is not parallelized, deepspeed will throw an error. How to handle this? `deepspeed` and `device_map` are incompatible; you can only choose one. ### Q24: Why does it need to download again when retraining offline, despite having already downloaded the dataset online? The data file contains URLs, which do not support offline training. ### Q25: How to reduce GPU memory usage when training VLM models? Set `--freeze_vit true` and the parameter `--max_pixels` to limit the maximum pixels. ### Q26: Why are there fewer models supported in the WEB-UI interface than in the documentation? Upgrade `ms-swift`. ### Q27: For models that do not have a suitable model_type, can I customize special_tokens and chat_template during SFT? Yes, you can. Refer to the PR for model integration and the custom model dataset documentation. ### Q28: Can I use DPO to train Qwen2-VL in a Python script? Yes, import `rlhf_main` and `RLHFArguments` from `swift.llm`. ### Q29: Can I pre-train with pure text before fine-tuning on a VQA dataset for MLLM? Yes, you can mix training as well. ### Q30: When conducting DPO training based on the qwen2 SFT model on a V100 machine, the training shows NaN? Use fp32 for training with the V100 machine. ### Q31: Does Swift support distillation? Refer to this [example](https://github.com/modelscope/ms-swift/blob/main/examples/sampler/distill/distill.sh). ### Q32: The default maximum number of checkpoints saved after training is two. How can I increase this number? Use `--save_total_limit`. See the [Command Line Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html). ### Q33: In grounding tasks, does the universal data format support multiple instances for one category? Currently, it supports one object corresponding to multiple bounding boxes. Refer to the documentation on [Custom Dataset](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#grounding). ### Q34: Why am I getting the error that numpy.object cannot be found? Try using `numpy==1.26.3`. ### Q35: Does the Swift framework support sequence parallelism now? Yes, it supports it. Refer to the [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text) here. ### Q36: When fine-tuning qwen2-1.5B on a V100, I see `loss': 0.0, 'acc': 0.0, 'grad_norm': nan`. What is the issue? Try using fp32. ### Q37: Is it possible to fully fine-tune GPTQ quantized models? No, GPTQ model's int-type parameters cannot participate in gradients; they can only be updated with additional structures like LoRA. ### Q38: What parameters should I set for fine-tuning using QLoRA on glm4-chat? Refer to the QLoRA [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora). ### Q39: How do I expand my vocabulary within the Swift framework? Swift currently does not support vocabulary expansion. ### Q40: Can I directly use models with the same name from Hugging Face? Set the environment variable `USE_HF=1`. ### Q41: Can Qwen2-VL-2B conduct incremental pre-training? Is there guidance available? Yes, it supports incremental pre-training. Just include all the content in the response. ### Q42: When training with videos, how can I control the frame sampling rate in the parameters? The `frame_rate` setting doesn't seem to work, and I'm using MiniCPMV. Set the environment variable `MAX_NUM_FRAMES`. ### Q43: Can I save the inference results of the validation set during training in Swift? After training, run `swift infer` to save the results. ### Q44: Why is the saved checkpoint larger than the original model file after full parameter DPO? Using V100 for fine-tuning stores the data in fp32 format. ### Q45: Training speed slows down when using multi-machine training; using Swift framework for LLM training with deepspeed zero3 causes significant performance drop. See the [issue](https://github.com/modelscope/ms-swift/issues/1825). ### Q46: Does Swift now support multi-stage pre-training for qwen2-vl? It looks like the official best practices only show SFT training with vit+llm together, not sure if separate fine-tuning is supported. You can control this using the parameters `--freeze_vit`, `--freeze_aligner`, and `--freeze_llm`. For more details, see the [Command Line Parameters Documentation](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#tuner-arguments). ### Q47: Does qwen2-vl support mixing pure text data? It supports both mixed visual-text and pure text data. ### Q48: Can I plot loss curves for different datasets during fine-tuning? Channel loss is supported. Please refer to this [example](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh). ### Q49: After model training, the responses have a lot of repeated content. Refer to the [Pre-training and Fine-tuning](https://swift.readthedocs.io/en/latest/Instruction/Pre-training-and-Fine-tuning.html). If you notice repetitions during training, try training for more epochs, cleaning the data, and conducting full parameter training, using RLHF to mitigate this issue. ### Q50: Does Swift currently support prompt tuning or prefix tuning? No, it does not support these methods, as both methods suffer from serious forgetting issues and are not currently recommended. ### Q51: I encountered the following error when training with two A10s: ```text [rank0]: torch.distributed.DistBackendError: NCCL error in:../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details),NCCL version 2.20.5 [rank0]:ncclSystemError: System call (e.g. socket,malloc) or external library call failed or device error. ``` Please check if shared memory is too small; NCCL requires shared memory. ### Q52: How to solve the issue of certain parameters not participating in backpropagation when freezing layers during DDP fine-tuning? Set the parameter `--ddp_find_unused_parameters true`. ### Q53: Does Swift have a dataset quality inspection tool? [data-juicer](https://github.com/modelscope/data-juicer). ### Q54: Where to start model parallelism on the web? I only found the option to check for data parallelism. You can specify visible GPUs to enable model parallelism. ### Q55: How can I set a fixed location for dataset downloads when using --dataset? I can't find this in command line parameters. How can I read from the download location next time? `dataset_path` supports folders, typically for datasets downloaded via `git clone`. See [Custom Dataset Documentation](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#dataset-info-json). ### Q56: When using --streaming true, I get an error asking me to set max_steps when setting num_train_epochs. Can't I just set num_train_epochs? See the streaming parameter description, [Command Line Parameters Documentation](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#data-arguments). ### Q57: Why is tools in "[]" format rather than directly using []? Could you explain why tools uses this "[]" format instead of direct [] notation? This is because the underlying pyarrow in datasets has strict type control. For the same reason, the objects part in our official grounding dataset also uses str, otherwise pyarrow would report errors about inconsistent types across rows. ### Q58: Can't this parameter be used? check_dataset_strategy==discard This parameter no longer exists in swift3.0, use the `strict` parameter instead. ### Q59: Getting this error when running sft command: ```text RuntimeError: Expected to mark a variable ready only once.This error is caused by one of the following reasons: 1) Use of a module parameter outsid forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph( ) as round if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple oint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph( ) as a workaround if dule graph does not change over iterations. ``` Add this parameter: `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. ### Q60: Have you encountered this issue? AttributeError:'TrainerState' object has no attribute 'last_model_checkpoint' Dataset is too small, need to add more data. Error occurs when data quantity is less than one step. ### Q61: I see preprocess can be defined in CustomPreprocessor. Is this processed all at once before training starts, or loaded during training? If `--streaming true` is set, it loads while training. By default, it processes everything before training. ### Q62: For full-parameter training of internvl2_5, why do vision_model and mlp1 appear in freeze parameters by default? Documentation shows freeze_parameters defaults to [], and command line settings for freeze vit, freeze aligner, freeze llm are all False. It prints trainable parameters: ['mlp1'] - unclear if only mlp1 is trainable or all parameters First freeze parameters then active parameters. The three parameters `freeze vit/freeze aligner/freeze llm` adjust freeze parameters and trainable parameters. Since some models' `vit` contains `aligner`, aligner is separately added to trainable_parameters. ### Q63: Does LlamaPro in swift support multimodal adaptation? Yes, it's supported. ### Q64: I noticed 2.x supports MAX_PIXELS. Is the --max_pixel parameter in 3.x documentation the same thing? What's the processing logic? Using 12000*9000 images with internvl still crashes in 2.x even with resacle_image Environment variable parameters correspond to model parameters. `MAX_PIXELS` only supports qwen2vl, internvl has its own environment variables. See [Specific Model Parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#specific-model-argumen). ### Q65: Is there documentation for fine-tuning qwen base model to chat model? Any special configurations needed? Use `swift sft`, no special configuration needed. See [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/base_to_chat). ### Q66: Where can I find sequence parallel examples? See this example: [sequence_parallel](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text). ### Q67: Can swift support training custom model structures? Yes, just customize the `get_model_tokenizer_xxx` function to return `model` and `tokenizer`. ### Q68: Getting an error using longlora with "name_or_path": "/mnt/workspace/model/Qwen2.5-14B-Instruct". Is longlora only for llama series? Yes, `longlora` only works with llama series. ### Q69: How to add custom special tokens in swift? Add them in the `get_model_tokenizer` function. ### Q70: For --freeze_parameters_ratio parameter, if set to 0.7, does it mean only 30% of llm parameters are updated during training? Is it random 30%? What's the update mechanism? Freezes from bottom to top. ### Q71: Why is the map process so slow? Is this normal? ```text Map: 4%|██ | 9000/203823 [02:18<50:34, 64.19 examples/s] ``` Use `--dataset_num_proc` parameter to enable multiple processes. ### Q72: How can I delete and redownload a dataset? I think there might be an issue with the dataset. Set the `--download_mode` parameter. ### Q73: How to solve this error: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge? The disk space is insufficient, and the model wasn't saved completely. ### Q74: Does swift3.0 not support get_default_template_type? Please check `model.model_meta.template`, the information is available in `model.model_meta` and `model.model_info`. ### Q75: Does ModelScope Swift support hermes format agent fine-tuning? I see qwen2.5 uses vllm with native support for hermes format tool calling, why don't I see it in Swift? Currently, `hermes` format is not supported. We mainly support `toolbench` and `react` formats, as `react` is more widely used. Swift's deploy currently supports parsing these two formats and provides `openai tool calling`. ### Q76: Is the default model training using left padding? Training can use either left or right padding. The default is right padding, while `batch infer` uses left padding. ### Q77: Does it support grounding tasks now? Yes, there's an [example](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/grounding.sh) under examples. ### Q78: Does ms-swift support contrastive learning for training llm_emb? Yes, here's an [example](https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding). ### Q79: Is there a big difference in performance between manually coding fine-tuning and GRPO using peft and trl libraries compared to Swift official training with the same parameters? The difference is minimal, with Swift additionally supporting multimodality. ### Q80: Does Swift currently not support audio modal input training for minicpmo2_6? It shows error: assert media_type in {'image', 'video'} Audio is not currently supported. ### Q81: Can Swift fine-tune deepseek R1 671B? Yes, the template is integrated, but the process is complicated as it requires converting fp8 to bf16 first. ### Q82: Isn't the latest Swift framework supposed to specify the model location using this command? This is the location of the model I've already downloaded, but I don't know why it still tries to download and fails with a git clone error ```shell --model /mnt/workspace/.cache/modelscope/hub/deepseek-ai/deepseek-vl2/ \ ``` Some models require cloning the repo and then specifying through `local_repo_path`. ### Q83: Does Swift now support multimodal GRPO? Yes, it does. ### Q84: Can the GRPO reward function be customized? Yes, refer to [examples/train/grpo/plugin](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/plugin). ### Q85: Why do I get the error when using --torch_dtype float16 (card cannot use bf16): lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. FP16 does not support full-parameter training. ### Q86: I have a question. I trained a reward model using Swift (baseline is qwen2.5-7b), but when loading it in PPO or GRPO, it shows an error. The reward model was trained using LoRA. ```shell --rlhf_type ppo \ --model Qwen/Qwen2.5-14B-Instruct \ --reward_model /mnt/workspace/output/rm/model --train_type lora \ --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' --torch_dtype float32 --num_train_epochs 1 \ --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --lora_rank 8 --lora_alpha 32 \ --target_modules all-linear \ --gradient_accumulation_steps 16 --eval_steps 100 --save_steps 100 \ ``` The LoRA-trained reward model needs to be merged. ### Q87: What version of transformers is needed to fine-tune deepseek_vl2? Official docs say <4.42, but it also shows errors with 4.42 and below. Does the peft version need to be lowered too? Use `peft==0.11.*`. ### Q88: Generate train split is too slow (about 30+ datasets with around a million total data points). Previously Swift 2.x wasn't this slow. Lazy tokenize is already enabled. Set `--dataset_num_proc 16`. ### Q89: How can I full-parameter fine-tune the visual encoder while using LoRA to fine-tune LLM when fine-tuning qwen2.5vl? Refer to this [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal/lora_llm_full_vit). ### Q90: How to use custom loss functions in Swift? Add it in the plugin. ### Q91: What are the parameters for MoE? Can't find keywords in the parameter table. How to set expert numbers and expert routing parameters? Use parameters directly from `config.json`. ### Q92: Using lmdeploy in grpo training reports missing functions. The load_weights function isn't found in lmdeployengine class. Only supported under turbomind engine. ### Q93: Getting errors when fine-tuning Moonlight-16B-A3B-Instruct model. Seems ms-swift doesn't support fine-tuning this model? Training is disabled in model files. Refer to deepseek_vl2's solution in the issues. ### Q94: How to solve this error: RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'? ```shell CUDA_VISIBLE_DEVICES=01,2,3,4,5,6,7 \ swift sft \ --model Internlm3-8b \ --dataset train.json \ --train_type full \ --torch_dtype bfloat16 \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --deepspeed zero3 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --gradient_accumulation_steps 16 \ --eval_steps 100 \ --save_steps 100 \ --save_total_limit 5 \ --logging_steps 5 \ --max_length 2048 \ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 ``` Upgrade torch. ### Q95: Is it normal that both loss and grad_norm are 0 during GRPO training? ```text {'loss': 0.0. 'grad norm':0.0, 'learning_rate':9e-08, 'memory(GiB)':88.1, 'train_speed(iter/s)':0.009252, 'completion_length': 150.00000763, 'response_clip ratio': 0.0, 'rewards/Format':1.0, 'reward : 1.0, 'reward std':0.0, 'kl': 0.0, 'clip_ratio': 0.0, 'epoch': 0.0, 'qlobal step/max steps':'1/1052', 'percentage':'0.10% 'elapsed time': '36s 'remaining time': '10h 43m 54s'} {'loss': 0.0,'grad_norm':0.0,'learning_rate': 1.8e-07,'memory(GiB)':94.15,'train_speed(iter/s)':0.014782,'completion_length': 133.25000763,'response_clip_ratio': 0.0,'rewards/Format': 1.0, 'rewa rd': 1.0,'reward_std': 0.0, 'kl': 0.0,'clip_ratio': 0.0,'epoch': 0.0, 'global_step/max_steps': '2/1052','percentage': '0.19%', 'elapsed_time': '1m 3s', 'remaining_time': '9h 19m 49s'} {'loss': 0.0, 'qrad norm': 0.0, 'learning rate': 2.7e-07,'memory(GiB)': 94.15,'train_speed(iter/s)': 0.018695,'completion_length': 123.08333969,,'response_clip_ratio': 0.0,'rewards/Format': 1.0, 'rewa rd': 1.0, 'reward_ std': 0.0,'kl': 0.0,'clip_ratio': 0.0, 'epoch': 0.0, 'global_step/max_steps': '3/1052','percentage': '0.29%,'elapsed_time': '1m 29s','remaining_time': '8h 39m 34s'} ``` Training with loss close to 0 is normal, refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851). ### Q96: Where can I pass in accuracy_orm for GRPO's built-in reward function? Currently it requires modifying the code directly. ### Q97: I notice the reward function has a solution parameter, does it need to be passed from the dataset? Does my dataset must have a solution field? Yes, it's necessary for math problems to calculate accuracy. ### Q98: Why is there no token_acc during training? Some models have mismatched `logits` and `labels` counts, so token accuracy isn't calculated. ### Q99: When fine-tuning Ovis2, LoRA parameters don't seem to work? Memory usage doesn't change with or without --train_type lora. Limit `--max_length`, this model is special and needs padding to max_length. ### Q100: Getting ValueError when running classification task with Qwen2.5: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask. dataset format: {"messages": [{"role": "user", "content": "xxxxx"}, {"label": 1}]} Put `label` at the same level as `messages`, not inside it. ### Q101: How to exit VllmEngine? I want to release GPU memory after inference rather than keeping it occupied. Use sleep mode: `engine.sleep(level=1)/engine.wake_up()` with `enable_sleep_mode=True` during initialization. ### Q102: Does trainer_sampler_random have no effect in streaming mode? Streaming is not random. ### Q103: Can trust_remote_code be set when using VLLM for GRPO training? It's true by default. ### Q104: For large dataset pretraining using streaming and packing, is there a way to calculate total steps based on epochs, batch size etc when setting max_steps? Set `--max_steps` or `--max_epochs`. For more details, see the streaming parameter description in the [Command Line Parameters Documentation](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#data-arguments). ### Q105: Unsloth training error: "assert(type(target modules) in (list,tuple,))" when using --target_modules all-linear Don't use `all-linear`, specify concrete module list like `--target_modules q k v`. ### Q106: Does Swift support multi-label classification now? Yes. Check custom dataset docs for format and search for `problem_type` in command line parameter docs. ### Q107: How does flash_attn handle packing - separately or merged? Flash attention is required to avoid errors, otherwise attention_mask will have issues. ### Q108: For qwen2.5-omni, does setting --freeze_vit false mean both the visual encoder and the audio encoder are enabled? Is there a way to enable only the audio encoder without enabling the visual encoder? Use `--target_regex`. ### Q109: Does swift currently support sequence parallelism for those reinforcement learning training methods? It supports pt, sft, dpo, and grpo. ### Q110: After using lora sft, is tokenizer.json not saved? Lora doesn't save it; it is migrated after merging because the lora directory needs to work with the original model. ### Q111: Can the reward_model and reward_funcs of GRPO be used together? Yes, they can. ### Q112: I want to ask if there is a parameter that can be adjusted to avoid introducing the KL term in GRPO? Search for `beta` in the command line parameters. ### Q113: When doing GRPO, how can I access the original labels in the orm custom reward function? I printed the messages field in kwargs, and the value of assistant's content in each item is replaced by the generated result. Place it in another column. ### Q114: If you use the default num_iterations=1, does clip become ineffective? The clip higher in dapo is also useless. I see that veRL has a micro batch setting to update the policy model in small batches for the clip term to take effect. In ms-swift, it seems mini batch only does gradient accumulation according to the source code? Yes, num_iterations needs to be >1. ### Q115: Does qwen2.5-omni training support full parameter training, and does it support talker training? Currently, it does not support talker training, only thinker. ### Q116: Can sequence parallel be enabled at the same time as the liger kernel? Yes, it can. ### Q117: What are the requirements for rm and policy in ppo training? PPO currently only supports rm and policy being from the same model series (tokenizer/template). ### Q118: I want to use the 3.2 1B model for fine-tuning because llama3.1 doesn't have models smaller than 8B. Can I still use the Llama-3.1 reward model? The requirement is that template and tokenizer must be the same, so 3.1 and 3.2 should be fine. ### Q119: Can swift cache a mapped version of data for troubleshooting training data issues? Set `--load_from_cache_file false`. ### Q120: Why is there a warning: none of the inputs have requires_grad=True during full parameter training? If vit is not being trained, getting this warning is normal; if it is being trained, then it should not occur. ### Q121: Does qwen2.5vl ulysses currently support sdpa? The vl model currently only supports flash-attn, but both are supported for pure text. ### Q122: Is the image list format for videos now supported? The format is as follows: ```json {"messages": [{"role": "assistant", "content": "