Instructions to use HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B

SGLang

How to use HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B with Docker Model Runner:
```
docker model run hf.co/HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B
```

Switch-KD-Qwen2.5-CLIP-1.8B

Switch-KD-Qwen2.5-CLIP-1.8B is a compact vision-language model (VLM) trained using the Switch-KD (Visual-Switch Knowledge Distillation) framework from Li Auto's MindKD technology. This model achieves competitive performance on multimodal benchmarks while being efficient for deployment.

Model Details

Base Model: Qwen2.5-1.5B-Instruct
Visual Encoder: CLIP-ViT-L/14-336
Projector: LDPNetV2 (FeatureIRLayer + TokenDownLayer + PosInjectLayer)
Total Parameters: ~1.8B
Image Resolution: 336×336
Context Length: 32,768 tokens
Training Method: Switch-KD distillation with DBiLD Loss
License: Apache 2.0

Architecture

Image (336×336) → CLIP ViT-L/14 (576 tokens) → LDPNetV2 Projector (144 tokens) → Qwen2.5-1.5B

The model uses a custom architecture where:

CLIP Vision Encoder extracts visual features at 336×336 resolution
LDPNetV2 Projector reduces visual tokens from 576 to 144 while preserving information
Qwen2.5-1.5B processes both visual and textual inputs for generation

Key Results

Switch-KD demonstrates significant improvements over baseline VLM distillation methods:

vs. Align-KD (1.5B Models)

+4.4% average improvement across 6 benchmarks
Uses only 1/3 the training data (1.2M vs 3.6M samples)

Benchmark Performance (Selected)

Benchmark	Score
MME (Perception)	1411.5
MMBench	68.4
GQA	61.9
ScienceQA	71.6
TextVQA	57.0
POPE	87.5

For detailed results, see the Switch-KD paper.

Installation

pip install transformers accelerate torch

Quickstart

Command Line Interface

# Single-round inference
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --question "Please describe this picture."

# Interactive multi-round chat
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive

# With custom settings
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive \
    --max-new-tokens 1024 \
    --torch-dtype fp16

Model Architecture

Visual Encoder (CLIP-ViT-L/14-336)

Hidden size: 1024
Layers: 24
Attention heads: 16
Image size: 336×336
Patch size: 14×14

Projector (LDPNetV2)

Projects 1024-dim visual features to 1536-dim LLM space
Reduces spatial tokens from 576 to 144
Components: FeatureIRLayer → TokenDownLayer → PosInjectLayer

Language Model (Qwen2.5-1.5B)

Hidden size: 1536
Layers: 28
Attention heads: 12 (2 key-value heads for GQA)
Context length: 32,768 tokens
Vocabulary: 151,936 tokens

Training

Switch-KD is trained using two key innovations:

Visual-Switch Distillation: Switches student visual outputs into teacher language pathway for cross-modal knowledge transfer
DBiLD Loss: Dynamic Bi-directional Logits Difference loss with adaptive top-K selection via Kneedle algorithm

Training Configuration

Training data: 1.2M image-text pairs
Optimizer: AdamW with cosine learning rate schedule
Batch size: 64 per GPU
Training epochs: Varies by configuration

Limitations

The model is primarily trained on English datasets and may have reduced performance on other languages
Best performance on images similar to training distribution (natural images, documents, charts)
May struggle with very low-resolution or extremely high-resolution images
Designed for single-image understanding (not optimized for video)

Citation

If you use this model, please cite:

@article{sun2026switchkd,
  title={Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models},
  author={Sun, Haoyi and Wang, Xiaoxiao and Mao, Ning and Wang, Qian and Mu, Lifu and Zheng, Wen and Wei, Tao and Chen, Wei},
  journal={arXiv preprint arXiv:2604.14629},
  year={2026}
}