# Inference and Deployment Below are the inference engines supported by Swift along with their corresponding capabilities. The three inference acceleration engines provide inference acceleration for Swift's inference, deployment, and evaluation modules: | Inference Acceleration Engine | OpenAI API | Multimodal | Quantized Model | Multiple LoRAs | QLoRA | Batch Inference | Parallel Techniques | | ------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ----- | ------------------------------------------------------------ | ------------------- | | pytorch | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) | DDP/device_map | | [vllm](https://github.com/vllm-project/vllm) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | ❌ | ✅ | TP/PP/DP | | [sglang](https://github.com/sgl-project/sglang) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | TP/PP/DP/EP | | [lmdeploy](https://github.com/InternLM/lmdeploy) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | ✅ | ❌ | ❌ | ✅ | TP/DP | ## Inference ms-swift uses a layered design philosophy, allowing users to perform inference through the command-line interface, web UI, or directly using Python. To view the inference of a model fine-tuned with LoRA, please refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#inference-fine-tuned-model). ### Using CLI **Full Parameter Model:** ```shell CUDA_VISIBLE_DEVICES=0 swift infer \ --model Qwen/Qwen2.5-7B-Instruct \ --stream true \ --infer_backend pt \ --max_new_tokens 2048 ``` **LoRA Model:** ```shell CUDA_VISIBLE_DEVICES=0 swift infer \ --model Qwen/Qwen2.5-7B-Instruct \ --adapters swift/test_lora \ --stream true \ --infer_backend pt \ --temperature 0 \ --max_new_tokens 2048 ``` **Command-Line Inference Instructions** The above commands are for interactive command-line interface inference. After running the script, you can simply enter your query in the terminal. You can also input the following special commands: - `multi-line`: Switch to multi-line mode, allowing line breaks in the input, ending with `#`. - `single-line`: Switch to single-line mode, with line breaks indicating the end of input. - `reset-system`: Reset the system and clear history. - `clear`: Clear the history. - `quit` or `exit`: Exit the conversation. **Multimodal Model** ```shell CUDA_VISIBLE_DEVICES=0 \ MAX_PIXELS=1003520 \ VIDEO_MAX_PIXELS=50176 \ FPS_MAX_FRAMES=12 \ swift infer \ --model Qwen/Qwen2.5-VL-3B-Instruct \ --stream true \ --infer_backend pt \ --max_new_tokens 2048 ``` To perform inference with a multimodal model, you can add tags like ``, `