Image-Text-to-Text
GGUF
English
multimodal
conversational

Keye-VL-2.0-30B-A3B (GGUF)

Kwai Keye-VL Logo

X Discord Homepage GitHub

Keye-VL Tech Report Keye-VL-1.5 Tech Report Keye-VL-2.0 Tech Report

Keye-VL-8B-Preview Keye-VL-1.5-8B Keye-VL-2.0-30B-A3B

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family.

Highlights

Video Benchmark Comparison
  • Outstanding Video Understanding and Temporal Localization: Across five video benchmarks, Keye-VL-2.0-30B-A3B leads open-source competitors and matches or surpasses Gemini-3-Flash on temporal grounding.

  • DSA-Native Long-Context Architecture: Sparse attention and targeted feature aggregation enable precise hour-long video understanding while keeping computation efficient.

  • High-Efficiency Inference and Training Stack: DSA (DeepSeek Sparse Attention), ExtraIO, heterogeneous ViT-LM parallelism, activation optimization, and custom kernels reduce long-sequence prefill cost and boost training throughput.

  • Data-Centric Multimodal Pre-Training: A carefully curated data pipeline, Keye-VL-1.5 vision encoder, and synthetic CoT data strengthen perception, OCR/chart/table understanding, and reasoning continuity.

  • Robust Post-Training for Reliable Reasoning: MOPD, bucket advantage scaling, Context-RL, and high-SNR data filtering improve cross-modal expert merging, reduce hallucinations, and stabilize long-context decisions.

  • Agent-Ready Multimodal Capabilities: Built-in Code, Tool, and Search agent abilities support repository tasks, API-style tool use, web-grounded search, and visual self-correction workflows.

As the first multi-modal model to land DSA in production, Keye-VL-2.0-30B-A3B delivers nearly lossless reasoning over 256K ultra-long context. It tops video understanding benchmarks at its scale and consistently rivals — or surpasses — top-tier closed-source models on fine-grained temporal perception. More importantly, it is the first Keye base model to ship with a built-in Agent collaboration mechanism, demonstrating solid system-level orchestration in Search, Tool, and Code scenarios.

Model Performance on Benchmarks

We compare Keye-VL-2.0-30B-A3B against leading open- and closed-source models (Qwen3.5-35B-A3B, InternVL3.5-241B-A28B, GPT-5-mini, Qwen3-VL 30B-A3B / 32B / 235B-A22B) across seven capability dimensions: Video, Coding, Agent, Math & Reasoning, STEM, Instruction Following, and General VQA.

Performance Comparison

Selected highlights (see the technical report for the full table):

  • Fine-grained Temporal Understanding (TimeLens):

    • Charades-TimeLens: 58.4 mIoU, on par with the strongest closed-source video baselines we tested (Gemini 3 Flash 61.19).
    • ActivityNet-TimeLens: 58.5 mIoU, surpassing Gemini 3 Flash (56.95).
    • QVHighlights-TimeLens: 70.1 mIoU, neck-and-neck with the top closed-source models on the official leaderboard and far ahead of Gemini 3 Flash (49.45).
  • Long-Context Scaling (VideoMME V2): Where most competitors degrade as the input frame count grows, our model's accuracy increases from 35.3% at 64 frames to 42.4% at 512 frames; the non-linear reasoning score climbs from 18.5 to 24.2.

  • Comprehensive Long-Video Understanding:

    • LongVideoBench: 74.1, surpassing both Qwen3.5-35B-A3B and the much larger Qwen3-VL-235B-A22B, demonstrating strong long-video understanding at 30B scale.

At 30B scale, Keye-VL-2.0-30B-A3B not only outperforms open-source models with 200B+ parameters (e.g., Qwen3-VL-235B) on temporal understanding, but also goes head-to-head with — and in places exceeds — top closed-source giants.

GGUF Model Weights

Component Quantization File Size
Language Model BF16 Keye-VL-2.0-30B-A3B-BF16.gguf 57 GB
Language Model F16 Keye-VL-2.0-30B-A3B-F16.gguf 58 GB
Language Model Q8_0 Keye-VL-2.0-30B-A3B-Q8_0.gguf 29 GB
Language Model Q4_K_M Keye-VL-2.0-30B-A3B-Q4_K_M.gguf 16 GB
Language Model Q3_K_M Keye-VL-2.0-30B-A3B-Q3_K_M.gguf 14 GB
Multimodal Projector BF16 mmproj-Keye-VL-2.0-30B-A3B-BF16.gguf 922 MB
Multimodal Projector F16 mmproj-Keye-VL-2.0-30B-A3B-F16.gguf 921 MB
Multimodal Projector Q8_0 mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf 613 MB

Quickstart

Related Repository

Build

git clone -b keye-vl-v2-30b-release https://github.com/Kwai-Keye/llama.cpp.git
cd llama.cpp

# CUDA build
cmake -B build-gpu -DGGML_CUDA=ON
cmake --build build-gpu --config Release -j$(nproc)

The server binary will be at build-gpu/bin/llama-server.

Launch Server

./build-gpu/bin/llama-server \
    -m Keye-VL-2.0-30B-A3B-Q4_K_M.gguf \
    --mmproj mmproj-Keye-VL-2.0-30B-A3B-Q8_0.gguf \
    --host 0.0.0.0 \
    --port 8000

Client Usage

The server exposes an OpenAI-compatible API. Below are examples for image and video inference.

Image Input

import json
import requests
import base64

BASE_URL = "http://localhost:8000"

def generate(messages):
    payload = {
        "model": "KeyeVL2",
        "messages": messages,
        "max_tokens": 256,
        "temperature": 0.0,
    }
    resp = requests.post(
        f"{BASE_URL}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1800,
    )
    resp.raise_for_status()
    return resp.json()

# Example: image + text
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"},
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

result = generate(messages)
print(result["choices"][0]["message"]["content"])

Technical Report

For more details, please refer to the Keye-VL-2.0 technical report: arXiv:2606.10651

Downloads last month
-
GGUF
Model size
31B params
Architecture
keye-vl2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF