Instructions to use Kwai-Keye/Keye-VL-2.0-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Kwai-Keye/Keye-VL-2.0-30B-A3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Kwai-Keye/Keye-VL-2.0-30B-A3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kwai-Keye/Keye-VL-2.0-30B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kwai-Keye/Keye-VL-2.0-30B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B
- SGLang
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kwai-Keye/Keye-VL-2.0-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kwai-Keye/Keye-VL-2.0-30B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kwai-Keye/Keye-VL-2.0-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kwai-Keye/Keye-VL-2.0-30B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B with Docker Model Runner:
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B
Keye-VL-2.0-30B-A3B
[๐ป GitHub Repository] [๐ Keye-VL-8B-Preview ] [๐ Keye-VL-1.5-8B ]
Meet Keye-VL-2.0-30B-A3B โ the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family.
Highlights
Outstanding Video Understanding and Temporal Localization: Across five video benchmarks, Keye-VL-2.0-30B-A3B leads open-source competitors and matches or surpasses Gemini-3-Flash on temporal grounding.
DSA-Native Long-Context Architecture: Sparse attention and targeted feature aggregation enable precise hour-long video understanding while keeping computation efficient.
High-Efficiency Inference and Training Stack: DSA (DeepSeek Sparse Attention), ExtraIO, heterogeneous ViT-LM parallelism, activation optimization, and custom kernels reduce long-sequence prefill cost and boost training throughput.
Data-Centric Multimodal Pre-Training: A carefully curated data pipeline, Keye-VL-1.5 vision encoder, and synthetic CoT data strengthen perception, OCR/chart/table understanding, and reasoning continuity.
Robust Post-Training for Reliable Reasoning: MOPD, bucket advantage scaling, Context-RL, and high-SNR data filtering improve cross-modal expert merging, reduce hallucinations, and stabilize long-context decisions.
Agent-Ready Multimodal Capabilities: Built-in Code, Tool, and Search agent abilities support repository tasks, API-style tool use, web-grounded search, and visual self-correction workflows.
As the first multi-modal model to land DSA in production, Keye-VL-2.0-30B-A3B delivers nearly lossless reasoning over 256K ultra-long context. It tops video understanding benchmarks at its scale and consistently rivals โ or surpasses โ top-tier closed-source models on fine-grained temporal perception. More importantly, it is the first Keye base model to ship with a built-in Agent collaboration mechanism, demonstrating solid system-level orchestration in Search, Tool, and Code scenarios.
Model Performance on Benchmarks
We compare Keye-VL-2.0-30B-A3B against leading open- and closed-source models (Qwen3.5-35B-A3B, InternVL3.5-241B-A28B, GPT-5-mini, Qwen3-VL 30B-A3B / 32B / 235B-A22B) across seven capability dimensions: Video, Coding, Agent, Math & Reasoning, STEM, Instruction Following, and General VQA.
Selected highlights (see the technical report for the full table):
Fine-grained Temporal Understanding (TimeLens):
- Charades-TimeLens: 58.4 mIoU, on par with the strongest closed-source video baselines we tested (Gemini 3 Flash 61.19).
- ActivityNet-TimeLens: 58.5 mIoU, surpassing Gemini 3 Flash (56.95).
- QVHighlights-TimeLens: 70.1 mIoU, neck-and-neck with the top closed-source models on the official leaderboard and far ahead of Gemini 3 Flash (49.45).
Long-Context Scaling (VideoMME V2): Where most competitors degrade as the input frame count grows, our model's accuracy increases from 35.3% at 64 frames to 42.4% at 512 frames; the non-linear reasoning score climbs from 18.5 to 24.2.
Comprehensive Long-Video Understanding:
- LongVideoBench: 74.1, surpassing both Qwen3.5-35B-A3B and the much larger Qwen3-VL-235B-A22B, demonstrating strong long-video understanding at 30B scale.
At 30B scale, Keye-VL-2.0-30B-A3B not only outperforms open-source models with 200B+ parameters (e.g., Qwen3-VL-235B) on temporal understanding, but also goes head-to-head with โ and in places exceeds โ top closed-source giants.
Quickstart
Related Repositories
- SGLang (custom branch): https://github.com/Kwai-Keye/sglang/tree/keye-vl-v2-30b-release
- DeepGEMM (Keye support): https://github.com/Kwai-Keye/DeepGEMM/tree/keye_support
- EffectiveKernels: https://github.com/Kwai-Keye/EffectiveKernels
Environment Setup
Option 1 โ Recommended: prebuilt Docker image
docker run -it --gpus all kwaikeye/kwai-keye-vl:keye_vl_v2_30b_a3b
Option 2 โ Install from source
# SGLang (custom branch)
git clone -b keye-vl-v2-30b-release https://github.com/Kwai-Keye/sglang.git
cd sglang
pip install -e python[all]
cd ..
# DeepGEMM (Keye support branch)
git clone -b keye_support https://github.com/Kwai-Keye/DeepGEMM.git
cd DeepGEMM
bash install.sh
cd ..
# EffectiveKernels
git clone https://github.com/Kwai-Keye/EffectiveKernels.git
cd EffectiveKernels
pip install -e . --no-deps --no-build-isolation
cd ..
Minimal Launch (H800)
python3 -m sglang.launch_server \
--model-path=MODEL_NAME \
--tp-size=2 \
--trust-remote-code \
--mem-fraction-static=0.8
This is a standard SGLang service โ call it with any standard OpenAI-compatible client.
Client Usage
Below are example SGLang inference scripts for both image and video inputs.
All sampling parameters, such as temperature, top_k, and others, are provided for demonstration purposes only and should not be treated as recommended settings. Users are encouraged to experiment with and adjust these parameters based on their own needs.
For video frame-sampling related parameters, users may also customize them as needed. Specifically, min_pixels and max_pixels can be used to set the lower and upper token limits for each frame, while video_total_pixels can be used to limit the total token budget of the entire video input.
If fps is not specified, the default value is 2.0.
Image Input
import json
import requests
BASE_URL = "http://MASTER_NODE_IP:8000"
def generate(messages):
payload = {
"model": "",
"messages": messages,
"n": 1,
"temperature": 0.0,
"max_tokens": 256,
"top_k": 1,
"ignore_eos": False,
"skip_special_tokens": True,
}
resp = requests.post(
f"{BASE_URL}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=1800,
)
resp.raise_for_status()
return resp.json()
# Example: image + text
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"},
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
result = generate(messages)
print(result["choices"][0]["message"]["content"])
Video Input
import json
import requests
BASE_URL = "http://MASTER_NODE_IP:8000"
def generate(messages):
payload = {
"model": "",
"messages": messages,
"n": 1,
"temperature": 0.0,
"max_tokens": 256,
"top_k": 1,
"ignore_eos": False,
"skip_special_tokens": True,
}
resp = requests.post(
f"{BASE_URL}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=1800,
)
resp.raise_for_status()
return resp.json()
# Example: Video + text
messages = [
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": video_url,
"preprocess_kwargs": {
"fps": fps,
"min_pixels": min_token*28*28,
"max_pixels": max_token*28*28,
"video_total_pixels":total_video_token*28*28,
}
},
},
{"type": "text", "text": "Describe this video."},
],
},
]
result = generate(messages)
print(result["choices"][0]["message"]["content"])
- Downloads last month
- 47
