Kaiju Coder 7 Runtime-Quantized Local Candidate
This is the current working local quantized variant for Kaiju Coder 7. It is a runtime bitsandbytes vLLM serving path, not a separate persisted quantized weight artifact yet.
Status
- Model id:
kaiju-coder-7 - Runtime:
gojira/vllm-openai-ray:nightly - Quantization mode: vLLM
--quantization bitsandbytes - Load format: vLLM
--load-format bitsandbytes - Required launch mode:
--language-model-only - Required OpenCode launch flag:
--enable-auto-tool-choice - Required preinstall in this image:
pandas - Tested contexts:
8192,16384 - OpenCode smoke: passed through the local fast proxy
- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke pending before public upload
Run
Use the guarded benchmark script from the repo root:
KAIJU_VLLM_CONTEXT=16384 \
KAIJU_VLLM_READY_TIMEOUT=1200 \
KAIJU_VLLM_QUANTIZATION=bitsandbytes \
KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
./scripts/run-gojira-b-vllm-serving-benchmark.sh
The script stops the merged SGLang service, starts vLLM on port 18084, runs
the benchmark, then restores SGLang unless KAIJU_VLLM_KEEP_RUNNING=1 is set.
For the current fast OpenCode setup, keep vLLM running and point the fast proxy
at port 18084.
KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
Evidence
Runs:
runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.mdruns/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.mdruns/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.mdruns/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.mdruns/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md
| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
|---|---|---|---|---|---|---|
| vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 |
| vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 |
| vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 |
| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
Gojira-B log evidence recorded model load at about 17.8 GiB memory for both
8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
over the full bfloat16 vLLM model load, which reported about 50.22 GiB.
The 16k business-document task passed, and the current speed pass keeps the
runtime-quantized vLLM service active for OpenCode through the local proxy.
The dedicated website harness/router speed pass produced a complete checked
website in about 7.2s through vLLM bitsandbytes:
- Direct website harness:
runs/harness/website-speed-pass/avery-stone-vllm.html - Router artifact:
runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html - Local-proxy router artifact:
runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html - Router checks: complete HTML, required sections, external images, responsive CSS, no lorem ipsum, manifest write
OpenCode one-file smoke also passed through the runtime-quantized endpoint:
bash scripts/run_kaiju_quantized_opencode_smoke.sh
Result:
- Workdir:
/tmp/kaiju-opencode-quantized-smoke - File:
hello.txt - Exact content:
Kaiju Coder 7 quantized runtime ok - OpenCode config: isolated temporary
HOME, no global config edit - Permission mode:
--dangerously-skip-permissionsinside the temporary smoke harness only
Persisted GGUF Candidate
A Q8_0 GGUF candidate now exists on Gojira-B:
/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
- Size:
27G - SHA256:
596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e - Conversion evidence:
runs/gguf-conversion/20260603T231446Z/gguf-conversion.log - Local docs:
release/gguf/README.md
This is not public quantized-weights release evidence yet. It still needs a runtime smoke that proves identity, business-owner output, and the intended OpenCode/router path under an actual GGUF runtime.
Release Interpretation
This is a working quantized local runtime candidate. It is useful for internal testing, serious GPU users, and the next paid API speed experiments. It is not yet a standalone public quantized weights repo because the only fully smoked path is still the full merged model loaded through bitsandbytes at runtime.
The next release step is to smoke-test the GGUF candidate or package this runtime path as an advanced serving recipe while clearly saying it still requires access to the full Kaiju Coder 7 merged weights.
