Instructions to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF",
	filename="Keye-VL-2.0-30B-A3B-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Ollama
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Ollama:
```
ollama run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
```

Unsloth Studio

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF to start chatting

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Docker Model Runner:
```
docker model run hf.co/Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M
```

Lemonade

How to use Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Keye-VL-2.0-30B-A3B-GGUF-Q4_K_M

List all available models

lemonade list

Keye-VL-2.0-30B-A3B (GGUF)

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family.

Highlights

Outstanding Video Understanding and Temporal Localization: Across five video benchmarks, Keye-VL-2.0-30B-A3B leads open-source competitors and matches or surpasses Gemini-3-Flash on temporal grounding.
DSA-Native Long-Context Architecture: Sparse attention and targeted feature aggregation enable precise hour-long video understanding while keeping computation efficient.
High-Efficiency Inference and Training Stack: DSA (DeepSeek Sparse Attention), ExtraIO, heterogeneous ViT-LM parallelism, activation optimization, and custom kernels reduce long-sequence prefill cost and boost training throughput.
Data-Centric Multimodal Pre-Training: A carefully curated data pipeline, Keye-VL-1.5 vision encoder, and synthetic CoT data strengthen perception, OCR/chart/table understanding, and reasoning continuity.
Robust Post-Training for Reliable Reasoning: MOPD, bucket advantage scaling, Context-RL, and high-SNR data filtering improve cross-modal expert merging, reduce hallucinations, and stabilize long-context decisions.
Agent-Ready Multimodal Capabilities: Built-in Code, Tool, and Search agent abilities support repository tasks, API-style tool use, web-grounded search, and visual self-correction workflows.

As the first multi-modal model to land DSA in production, Keye-VL-2.0-30B-A3B delivers nearly lossless reasoning over 256K ultra-long context. It tops video understanding benchmarks at its scale and consistently rivals — or surpasses — top-tier closed-source models on fine-grained temporal perception. More importantly, it is the first Keye base model to ship with a built-in Agent collaboration mechanism, demonstrating solid system-level orchestration in Search, Tool, and Code scenarios.

Model Performance on Benchmarks

We compare Keye-VL-2.0-30B-A3B against leading open- and closed-source models (Qwen3.5-35B-A3B, InternVL3.5-241B-A28B, GPT-5-mini, Qwen3-VL 30B-A3B / 32B / 235B-A22B) across seven capability dimensions: Video, Coding, Agent, Math & Reasoning, STEM, Instruction Following, and General VQA.

Selected highlights (see the technical report for the full table):

Fine-grained Temporal Understanding (TimeLens):
- Charades-TimeLens: 58.4 mIoU, on par with the strongest closed-source video baselines we tested (Gemini 3 Flash 61.19).
- ActivityNet-TimeLens: 58.5 mIoU, surpassing Gemini 3 Flash (56.95).
- QVHighlights-TimeLens: 70.1 mIoU, neck-and-neck with the top closed-source models on the official leaderboard and far ahead of Gemini 3 Flash (49.45).
Long-Context Scaling (VideoMME V2): Where most competitors degrade as the input frame count grows, our model's accuracy increases from 35.3% at 64 frames to 42.4% at 512 frames; the non-linear reasoning score climbs from 18.5 to 24.2.
Comprehensive Long-Video Understanding:
- LongVideoBench: 74.1, surpassing both Qwen3.5-35B-A3B and the much larger Qwen3-VL-235B-A22B, demonstrating strong long-video understanding at 30B scale.

At 30B scale, Keye-VL-2.0-30B-A3B not only outperforms open-source models with 200B+ parameters (e.g., Qwen3-VL-235B) on temporal understanding, but also goes head-to-head with — and in places exceeds — top closed-source giants.

GGUF Model Weights

Component	Quantization	File	Size
Language Model	BF16	`Keye-VL-2.0-30B-A3B-BF16.gguf`	57 GB
Language Model	F16	`Keye-VL-2.0-30B-A3B-F16.gguf`	58 GB
Language Model	Q8_0	`Keye-VL-2.0-30B-A3B-Q8_0.gguf`	29 GB
Language Model	Q4_K_M	`Keye-VL-2.0-30B-A3B-Q4_K_M.gguf`	16 GB
Language Model	Q3_K_M	`Keye-VL-2.0-30B-A3B-Q3_K_M.gguf`	14 GB
Multimodal Projector	BF16	`mmproj-Keye-VL-2.0-30B-A3B-BF16.gguf`	922 MB
Multimodal Projector	F16	`mmproj-Keye-VL-2.0-30B-A3B-F16.gguf`	921 MB
Multimodal Projector	Q8_0	`mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf`	613 MB

Quickstart

Related Repository

llama.cpp (KeyeVL2 support): https://github.com/Kwai-Keye/llama.cpp.git (keye-vl-v2-30b-release branch)

Build

git clone -b keye-vl-v2-30b-release https://github.com/Kwai-Keye/llama.cpp.git
cd llama.cpp

# CUDA build
cmake -B build-gpu -DGGML_CUDA=ON
cmake --build build-gpu --config Release -j$(nproc)

The server binary will be at build-gpu/bin/llama-server.

Launch Server

./build-gpu/bin/llama-server \
    -m Keye-VL-2.0-30B-A3B-Q4_K_M.gguf \
    --mmproj mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf \
    --host 0.0.0.0 \
    --port 8000

Client Usage

The server exposes an OpenAI-compatible API. Below are examples for image and video inference.

Image Input

import json
import requests
import base64

BASE_URL = "http://localhost:8000"

def generate(messages):
    payload = {
        "model": "KeyeVL2",
        "messages": messages,
        "max_tokens": 256,
        "temperature": 0.0,
    }
    resp = requests.post(
        f"{BASE_URL}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1800,
    )
    resp.raise_for_status()
    return resp.json()

# Example: image + text
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"},
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

result = generate(messages)
print(result["choices"][0]["message"]["content"])

Technical Report

For more details, please refer to the Keye-VL-2.0 technical report: arXiv:2606.10651

Downloads last month: -

GGUF

Model size

31B params

Architecture

keye-vl2

Hardware compatibility

3-bit

4-bit

8-bit

16-bit

Papers for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF