Intern-S2-Preview-FP8 GGUF Q4_K_M + vision mmproj
GGUF conversion of internlm/Intern-S2-Preview-FP8 for llama.cpp/TurboQuant serving, with a verified local vision/mmproj path.
This repo is packaged as the original multimodal Intern-S2 deployment: the quantized Intern-S2 language GGUF plus a compatible vision projector (mmproj) for image input. Intern-S2's HF checkpoint is a custom multimodal wrapper; the llama.cpp language GGUF is produced by flattening the Qwen3.5 MoE language config into a CausalLM-compatible GGUF, then pairing it with the compatible Qwen3.6-35B-A3B qwen3vl_merger projector. The verified local runtime uses the patched llama.cpp/TurboQuant build documented below.
Files
Intern-S2-Preview-FP8-text-Q4_K_M.gguf- size:
21,721,180,960bytes /20.22 GiB - quant:
Q4_K_M - BPW:
4.89 - params reported by llama.cpp:
35.52 B, active MoE type35B.A3B - vocab:
251392 - tensors:
753
- size:
mmproj-Intern-S2-Preview-FP8-F16.gguf- size:
899,283,584bytes /857.62 MiB - source local filename:
mmproj-Qwen3.6-35B-A3B-F16.gguf - projector:
qwen3vl_merger - projection dim:
2048, matching Intern-S2 text hidden size
- size:
docs/CONVERSION_AND_SERVING.mdโ full conversion notes, required converter patches, llama.cpp version, serving config, benchmarksdocs/VISION_STATUS.mdโ verified vision setup, runtime patch, API smoke result, caveatsbench/SUMMARY.mdโ generation and perplexity smoke summarybench/vision_smoke.jsonโ verified image-input API smoke resultpatches/llama-cpp-intern-s2-converter.patchโ local converter patch used with the TurboQuant llama.cpp forkpatches/llama-cpp-mtmd-embd-getrows-fix.patchโ local runtime patch required for stable mtmd image embedding evaluation
SHA-256:
cc28d91bb2cad0d591621e6879c8821e3e4481448da9d62e4b22e5c9c780b927 Intern-S2-Preview-FP8-text-Q4_K_M.gguf
71f3cbc1f7cc0f30d09d41cfa924c0060827ebc33bf15ace7e86661e856f0160 mmproj-Intern-S2-Preview-FP8-F16.gguf
Upstream model
Base model: https://huggingface.co/internlm/Intern-S2-Preview-FP8
This quant is linked back to the upstream model through the model card base_model metadata and this README. Use the upstream repository for license, intended use, tokenizer provenance, and original model details.
Tested serving mode
Text/MTP alias
Local tested stack:
llama.cpp/TurboQuant fork: https://github.com/TheTom/llama-cpp-turboquant.git
commit: 03ddcd585f1c708dce946c8deb4b7c45133ae3c6
version description: b9020-171-g03ddcd585-dirty
runtime: llama-swap -> llama-server
front door: :9069
backend: llama-server
Example llama-server command:
llama-server \
--model Intern-S2-Preview-FP8-text-Q4_K_M.gguf \
--ctx-size 204800 \
--threads 16 \
--batch-size 256 \
--ubatch-size 64 \
--flash-attn on \
--cont-batching \
--parallel 1 \
--host 0.0.0.0 \
--split-mode layer \
--tensor-split 0.50,0.50 \
-ctk turbo3 \
-ctv turbo3 \
-ngl 999 \
--temp 0 \
--top-k 1 \
--top-p 1 \
--min-p 0 \
--repeat-penalty 1.0 \
--spec-type mtp \
--spec-draft-n-max 1 \
--jinja \
--chat-template-kwargs '{"enable_thinking":false}' \
--reasoning off
Quick smoke:
curl -s http://127.0.0.1:9069/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"intern-s2-mtp","messages":[{"role":"user","content":"Reply exactly: ok"}],"max_tokens":8,"temperature":0,"stream":false}'
A valid MTP smoke should include nonzero timings.draft_n and timings.draft_n_accepted.
Vision alias
Vision is served as a separate conservative alias. It intentionally disables TurboKV/MTP/flash-attn during image bring-up.
llama-server \
--model Intern-S2-Preview-FP8-text-Q4_K_M.gguf \
--mmproj mmproj-Intern-S2-Preview-FP8-F16.gguf \
--no-mmproj-offload \
--ctx-size 4096 \
--threads 16 \
--batch-size 128 \
--ubatch-size 32 \
--flash-attn off \
--parallel 1 \
--split-mode layer \
--tensor-split 0.50,0.50 \
-ctk f16 \
-ctv f16 \
-ngl 999 \
--temp 0 \
--top-k 1 \
--top-p 1 \
--min-p 0 \
--repeat-penalty 1.0 \
--jinja \
--chat-template-kwargs '{"enable_thinking":false}' \
--reasoning off
Verified image smoke response:
A red square with the text "RED SQUARE" in black, centered on a white background.
See docs/VISION_STATUS.md for the exact API request and the local llama.cpp/mtmd runtime patch required to avoid the image-embedding crash.
Benchmarks from local smoke
Generation via llama-swap/OpenAI API, deterministic settings (temperature=0, top_k=1, top_p=1):
| prompt | completion tokens | tok/s | MTP accepted |
|---|---|---|---|
| smoke | 3 | 126.51 | 1/1 |
| short_reason | 93 | 111.41 | 37/46 |
| code | 189 | 147.03 | 90/94 |
| longer | 256 | 111.09 | 101/127 |
Perplexity smoke on a small fixed local technical corpus:
PPL = 28.8177 +/- 9.76237
prompt eval: 192 tokens, 1743.28 tokens/s
This PPL is a smoke test only, not a standardized benchmark. Raw artifacts are under bench/.
Conversion warning
This model does not convert cleanly with stock llama.cpp as a top-level InternS2PreviewForConditionalGeneration model. Required handling:
- flatten
config.json["text_config"]into a llama.cpp-compatible language GGUF shim - force
architectures = ["Qwen3_5MoeForCausalLM"] - preserve quantization and tokenizer metadata
- ignore packed FP8 expert scale tensors during expert merge
- remap MTP layer tensor names and block IDs
- pad tokenizer metadata from raw
config.jsonvocab size (251392)
See docs/CONVERSION_AND_SERVING.md for exact patch shapes and commands.
Limitations
- This package is the multimodal Intern-S2 GGUF setup: language GGUF plus
mmproj-Intern-S2-Preview-FP8-F16.gguf. Vision requires the patched llama.cpp/TurboQuant runtime documented indocs/VISION_STATUS.md. - Vision is verified with a conservative 4K-context alias. 65K/200K vision has not been revalidated.
- 200K text-context serving uses TurboKV and is beyond the original
n_ctx_train = 32768; treat it as an operational smoke, not a full quality guarantee. - Deterministic serving is recommended until stochastic MTP behavior is separately validated.
Model tree for crogers2287/Intern-S2-Preview-FP8-GGUF
Base model
internlm/Intern-S2-Preview