Kaiju Coder 7 Runtime-Quantized Local Candidate

RMDW logo

This is the current working local quantized variant for Kaiju Coder 7. It is a runtime bitsandbytes vLLM serving path, not a separate persisted quantized weight artifact yet.

Status

  • Model id: kaiju-coder-7
  • Runtime: gojira/vllm-openai-ray:nightly
  • Quantization mode: vLLM --quantization bitsandbytes
  • Load format: vLLM --load-format bitsandbytes
  • Required launch mode: --language-model-only
  • Required OpenCode launch flag: --enable-auto-tool-choice
  • Required preinstall in this image: pandas
  • Tested contexts: 8192, 16384
  • OpenCode smoke: passed through the local fast proxy
  • Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke pending before public upload

Run

Use the guarded benchmark script from the repo root:

KAIJU_VLLM_CONTEXT=16384 \
KAIJU_VLLM_READY_TIMEOUT=1200 \
KAIJU_VLLM_QUANTIZATION=bitsandbytes \
KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
  ./scripts/run-gojira-b-vllm-serving-benchmark.sh

The script stops the merged SGLang service, starts vLLM on port 18084, runs the benchmark, then restores SGLang unless KAIJU_VLLM_KEEP_RUNNING=1 is set. For the current fast OpenCode setup, keep vLLM running and point the fast proxy at port 18084.

KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181

Evidence

Runs:

  • runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md
  • runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md
  • runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md
  • runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md
  • runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md
Runtime Context Prompt OK Seconds Chars Chars/s
vLLM bitsandbytes 8192 identity True 21.19 26 1.227
vLLM bitsandbytes 8192 code_patch True 11.31 424 37.489
vLLM bitsandbytes 16384 identity True 19.51 26 1.333
vLLM bitsandbytes 16384 code_patch True 11.3 416 36.814
vLLM bitsandbytes 16384 business_doc True 53.44 1610 30.127
vLLM bitsandbytes 16384 identity True 19.65 26 1.323
vLLM bitsandbytes 16384 code_patch True 24.97 997 39.924
vLLM bitsandbytes 16384 business_doc True 34.46 1615 46.874

Gojira-B log evidence recorded model load at about 17.8 GiB memory for both 8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement over the full bfloat16 vLLM model load, which reported about 50.22 GiB. The 16k business-document task passed, and the current speed pass keeps the runtime-quantized vLLM service active for OpenCode through the local proxy.

The dedicated website harness/router speed pass produced a complete checked website in about 7.2s through vLLM bitsandbytes:

  • Direct website harness: runs/harness/website-speed-pass/avery-stone-vllm.html
  • Router artifact: runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html
  • Local-proxy router artifact: runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html
  • Router checks: complete HTML, required sections, external images, responsive CSS, no lorem ipsum, manifest write

OpenCode one-file smoke also passed through the runtime-quantized endpoint:

bash scripts/run_kaiju_quantized_opencode_smoke.sh

Result:

  • Workdir: /tmp/kaiju-opencode-quantized-smoke
  • File: hello.txt
  • Exact content: Kaiju Coder 7 quantized runtime ok
  • OpenCode config: isolated temporary HOME, no global config edit
  • Permission mode: --dangerously-skip-permissions inside the temporary smoke harness only

Persisted GGUF Candidate

A Q8_0 GGUF candidate now exists on Gojira-B:

/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
  • Size: 27G
  • SHA256: 596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e
  • Conversion evidence: runs/gguf-conversion/20260603T231446Z/gguf-conversion.log
  • Local docs: release/gguf/README.md

This is not public quantized-weights release evidence yet. It still needs a runtime smoke that proves identity, business-owner output, and the intended OpenCode/router path under an actual GGUF runtime.

Release Interpretation

This is a working quantized local runtime candidate. It is useful for internal testing, serious GPU users, and the next paid API speed experiments. It is not yet a standalone public quantized weights repo because the only fully smoked path is still the full merged model loaded through bitsandbytes at runtime.

The next release step is to smoke-test the GGUF candidate or package this runtime path as an advanced serving recipe while clearly saying it still requires access to the full Kaiju Coder 7 merged weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support