🐛 Bug Report: Truncated JSON / Unterminated String in Tool Calling with `kimi-k2.6` + EAGLE3 Speculative Decoding

by bencswong2 - opened 23 days ago

Summary

When serving kimi-k2.6 with the kimi-k2.6-eagle3 draft model (EAGLE3 speculative decoding / MTP) on 8× H20 using vLLM v0.19.1, the model frequently generates corrupted/truncated JSON during tool-calling turns. This causes downstream tool parsers (e.g., OpenCode's bash tool) to fail with JSON Parse error: Unterminated string.

The issue does not reproduce with kimi-k2.5 + kimi-k2.5-eagle3 on the exact same vLLM v0.19.1 image and identical hardware. This isolates the bug to the k2.6 model family (or its EAGLE3 draft model) under speculative decoding.

Note: We also verified the issue persists on the deepseekv4-cu129 pre-release image, but the primary regression is observed on the stable v0.19.1 image.

Environment

Hardware: 8× NVIDIA H20
Serving Engine: vllm/vllm-openai:v0.19.1
Base Model (Fail): /model_files/kimi-k2.6
Draft Model (Fail): /model_files/kimi-k2.6-eagle3 (EAGLE3)
Base Model (Pass): /model_files/Kimi-K2.5
Draft Model (Pass): /model_files/Kimi-K2.5-Eagle3
Tensor Parallelism: 8
Client / Agent: OpenCode (coding agents with tool calling)

Reproduction — Failing Case (k2.6 + EAGLE3) on vLLM v0.19.1

docker run --rm -it --name vllm-kimi-k2.6 \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
    -p 8000:8000 \
    -v /model_files:/model_files \
    vllm/vllm-openai:v0.19.1 \
    --model /model_files/kimi-k2.6 \
    --tensor-parallel-size 8 \
    --served-model-name /model_files/h20/Kimi-K2.5 \
    --compilation-config '{"pass_config": {"fuse_allreduce_rms": true}}' \
    --speculative-config '{"model": "/model_files/kimi-k2.6-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
    --gpu-memory-utilization 0.90 \
    --max-num-batched-tokens 16384 \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --mm-processor-cache-gb 128 \
    --max-num-seqs 25 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --safetensors-load-strategy=prefetch

Result: Tool calling fails with corrupted JSON output.

Control Experiment — Passing Case (k2.5 + EAGLE3) on vLLM v0.19.1

docker run --rm -it --name vllm-kimi-mtp \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --gpus all \
    -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
    -p 8000:8000 \
    -v /ssd1:/model_files \
    vllm/vllm-openai:v0.19.1 \
    --model /model_files/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --served-model-name /model_files/h20/Kimi-K2.5 \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --speculative-config '{"model": "/model_files/Kimi-K2.5-Eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
    --gpu-memory-utilization 0.92 \
    --max-num-batched-tokens 8192 \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --mm-processor-cache-gb 128 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --safetensors-load-strategy=prefetch

Result: Tool calling works correctly; JSON is well-formed and complete.

Observed Error

During tool invocation, the model output is truncated inside a string value, causing the parser to fail:

⚙ invalid [tool=bash, error=Invalid input for tool bash: JSON parsing failed: 
Text: {"command": "bash ~/.claude/skills/web-access/scripts/check-deps.sh", "description": "检查 CDP 依赖环境.
Error message: JSON Parse error: Unterminated string]

Key observation: The truncation occurs mid-Unicode-string (inside a Chinese text value). The closing quote is missing, suggesting the token stream was cut at an invalid byte boundary rather than at a valid JSON boundary.

Analysis & Hypothesis

The only meaningful differences between the Pass and Fail configurations are:

Base model: k2.5 → k2.6
Draft model: k2.5-eagle3 → k2.6-eagle3

Everything else — vLLM version (v0.19.1), hardware (8× H20), TP size, EAGLE3 method, and tool-calling flags — is held constant. This strongly suggests the issue is not a generic vLLM speculative-decoding regression, but rather specific to the kimi-k2.6 model (or its EAGLE3 draft model).

We also reproduced the failure on the deepseekv4-cu129 pre-release image, indicating the bug survives across vLLM versions.

Possible cause: EAGLE3 performs token-level speculation. If the draft model and target model disagree on token boundaries—particularly around multi-byte UTF-8 characters (e.g., Chinese text)—the verification/rollback mechanism may truncate the output stream at an invalid byte offset. The k2.6 tokenizer or output distribution may have changed in a way that amplifies this boundary misalignment compared to k2.5.

Requests for Maintainers / Model Authors

Has kimi-k2.6-eagle3 been internally validated with structured generation / tool calling / JSON mode under EAGLE3 speculative decoding?
Have you observed similar truncation issues where speculative decoding cuts UTF-8 strings mid-character?
Is there a recommended generation config or vLLM flag (e.g., adjusting --num-speculative-tokens, disabling chunked prefill, changing acceptance_sampler) to mitigate this until a fix is available?
Could there be a mismatch between the k2.6 base tokenizer and the k2.6-eagle3 draft tokenizer/vocabulary that breaks the speculative path?

Additional Context

The issue is reliable (not flaky) when using k2.6 + eagle3 with coding agents.
Disabling speculative decoding (removing --speculative-config) prevents the issue, confirming the bug is triggered by the MTP path.

Labels: bug, speculative-decoding, eagle3, tool-calling, k2.6

rogerwyf

19 days ago

Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!

bencswong2

14 days ago

Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.

bencswong2

14 days ago

Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!

Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment