Instructions to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF", filename="Keye-VL-2.0-30B-A3B-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
- Ollama
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Ollama:
ollama run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
- Unsloth Studio
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting
- Pi
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Docker Model Runner:
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
- Lemonade
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Keye-VL-2.0-30B-A3B-GGUF-Q4_K_M
List all available models
lemonade list
Keye-VL-2.0-30B-A3B (GGUF)
Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family.
Highlights
Outstanding Video Understanding and Temporal Localization: Across five video benchmarks, Keye-VL-2.0-30B-A3B leads open-source competitors and matches or surpasses Gemini-3-Flash on temporal grounding.
DSA-Native Long-Context Architecture: Sparse attention and targeted feature aggregation enable precise hour-long video understanding while keeping computation efficient.
High-Efficiency Inference and Training Stack: DSA (DeepSeek Sparse Attention), ExtraIO, heterogeneous ViT-LM parallelism, activation optimization, and custom kernels reduce long-sequence prefill cost and boost training throughput.
Data-Centric Multimodal Pre-Training: A carefully curated data pipeline, Keye-VL-1.5 vision encoder, and synthetic CoT data strengthen perception, OCR/chart/table understanding, and reasoning continuity.
Robust Post-Training for Reliable Reasoning: MOPD, bucket advantage scaling, Context-RL, and high-SNR data filtering improve cross-modal expert merging, reduce hallucinations, and stabilize long-context decisions.
Agent-Ready Multimodal Capabilities: Built-in Code, Tool, and Search agent abilities support repository tasks, API-style tool use, web-grounded search, and visual self-correction workflows.
As the first multi-modal model to land DSA in production, Keye-VL-2.0-30B-A3B delivers nearly lossless reasoning over 256K ultra-long context. It tops video understanding benchmarks at its scale and consistently rivals — or surpasses — top-tier closed-source models on fine-grained temporal perception. More importantly, it is the first Keye base model to ship with a built-in Agent collaboration mechanism, demonstrating solid system-level orchestration in Search, Tool, and Code scenarios.
Model Performance on Benchmarks
We compare Keye-VL-2.0-30B-A3B against leading open- and closed-source models (Qwen3.5-35B-A3B, InternVL3.5-241B-A28B, GPT-5-mini, Qwen3-VL 30B-A3B / 32B / 235B-A22B) across seven capability dimensions: Video, Coding, Agent, Math & Reasoning, STEM, Instruction Following, and General VQA.
Selected highlights (see the technical report for the full table):
Fine-grained Temporal Understanding (TimeLens):
- Charades-TimeLens: 58.4 mIoU, on par with the strongest closed-source video baselines we tested (Gemini 3 Flash 61.19).
- ActivityNet-TimeLens: 58.5 mIoU, surpassing Gemini 3 Flash (56.95).
- QVHighlights-TimeLens: 70.1 mIoU, neck-and-neck with the top closed-source models on the official leaderboard and far ahead of Gemini 3 Flash (49.45).
Long-Context Scaling (VideoMME V2): Where most competitors degrade as the input frame count grows, our model's accuracy increases from 35.3% at 64 frames to 42.4% at 512 frames; the non-linear reasoning score climbs from 18.5 to 24.2.
Comprehensive Long-Video Understanding:
- LongVideoBench: 74.1, surpassing both Qwen3.5-35B-A3B and the much larger Qwen3-VL-235B-A22B, demonstrating strong long-video understanding at 30B scale.
At 30B scale, Keye-VL-2.0-30B-A3B not only outperforms open-source models with 200B+ parameters (e.g., Qwen3-VL-235B) on temporal understanding, but also goes head-to-head with — and in places exceeds — top closed-source giants.
GGUF Model Weights
| Component | Quantization | File | Size |
|---|---|---|---|
| Language Model | BF16 | Keye-VL-2.0-30B-A3B-BF16.gguf |
57 GB |
| Language Model | F16 | Keye-VL-2.0-30B-A3B-F16.gguf |
58 GB |
| Language Model | Q8_0 | Keye-VL-2.0-30B-A3B-Q8_0.gguf |
29 GB |
| Language Model | Q4_K_M | Keye-VL-2.0-30B-A3B-Q4_K_M.gguf |
16 GB |
| Language Model | Q3_K_M | Keye-VL-2.0-30B-A3B-Q3_K_M.gguf |
14 GB |
| Multimodal Projector | BF16 | mmproj-Keye-VL-2.0-30B-A3B-BF16.gguf |
922 MB |
| Multimodal Projector | F16 | mmproj-Keye-VL-2.0-30B-A3B-F16.gguf |
921 MB |
| Multimodal Projector | Q8_0 | mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf |
613 MB |
Quickstart
Related Repository
- llama.cpp (KeyeVL2 support): https://github.com/Kwai-Keye/llama.cpp.git (
keye-vl-v2-30b-releasebranch)
Build
git clone -b keye-vl-v2-30b-release https://github.com/Kwai-Keye/llama.cpp.git
cd llama.cpp
# CUDA build
cmake -B build-gpu -DGGML_CUDA=ON
cmake --build build-gpu --config Release -j$(nproc)
The server binary will be at build-gpu/bin/llama-server.
Launch Server
./build-gpu/bin/llama-server \
-m Keye-VL-2.0-30B-A3B-Q4_K_M.gguf \
--mmproj mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf \
--host 0.0.0.0 \
--port 8000
Client Usage
The server exposes an OpenAI-compatible API. Below are examples for image and video inference.
Image Input
import json
import requests
import base64
BASE_URL = "http://localhost:8000"
def generate(messages):
payload = {
"model": "KeyeVL2",
"messages": messages,
"max_tokens": 256,
"temperature": 0.0,
}
resp = requests.post(
f"{BASE_URL}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=1800,
)
resp.raise_for_status()
return resp.json()
# Example: image + text
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"},
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
result = generate(messages)
print(result["choices"][0]["message"]["content"])
Technical Report
For more details, please refer to the Keye-VL-2.0 technical report: arXiv:2606.10651
- Downloads last month
- -
3-bit
4-bit
8-bit
16-bit
