how to include the mmproj-F16.gguf

#3
by puzert - opened

I try use the mmproj-F16.gguf from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF.

seems not working.

any advice?

I try use the mmproj-F16.gguf from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF.

seems not working.

any advice?

Hey! Happy to help. I just got vision running on this one myself, so let me share exactly what worked for me.

I'm on the charlie12345/rocmfp4-llama fork, branch mtp-rocmfp4-strix (commit 79066b6), running the 27B from plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF. Your mmproj-F16.gguf from unsloth is the right file to pair with it — that part's correct.

Here's the launch command I used (vision + MTP both on):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
./llama-server \
  -m Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  --mmproj mmproj-F16.gguf \
  --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap

(Heads up: HSA_OVERRIDE_GFX_VERSION=11.5.1 is for gfx1151 / Strix Halo — set it to match your GPU.)

The part that trips most people up is how you send the image. I sent it as a base64 data URI in an image_url part, straight to the server:

B64=$(base64 -w0 yourimage.png)
curl -s http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d "{
  \"messages\":[{\"role\":\"user\",\"content\":[
    {\"type\":\"text\",\"text\":\"What's in this image?\"},
    {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/png;base64,$B64\"}}
  ]}],\"max_tokens\":250,\"stream\":false}"

That worked first try for me — the log showed loaded multimodal model and image processed in 624 ms, and it described my test image correctly.

A few things that might be biting you:

  • If you see find_slot: non-consecutive token position warnings during image processing, no worries — those are harmless, vision still works.
  • If descriptions come out vague, add --image-min-tokens 1024.
  • And if that curl works but it fails through your app (OpenCode or similar), the client's probably stripping the image — those need the model marked vision-capable: "modalities": {"input": ["text","image"], "output": ["text"]}.

Try the direct curl first — if it still doesn't work, send me your launch command and what you're seeing (error, crash, or it just ignoring the image) and we'll get it sorted. 🙂

I am using window 11, and rocm 7.1.

I use the same commit as you and build with this

cmake -S . -B rocm_build ^
    -G "Ninja" ^
    -DCMAKE_BUILD_TYPE=Release ^
    -DGGML_HIP=ON ^
    -DGGML_CUDA=OFF ^
    -DGGML_VULKAN=ON ^
    -DGGML_HIP_FORCE_MMQ=ON ^
    -DHIP_PLATFORM=amd ^
    -DGPU_TARGETS=gfx1151 ^
    -DLLAMA_BUILD_WEBUI=OFF ^
    -DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/7.1/bin/clang.exe" ^
    -DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/7.1/bin/clang++.exe"

cmake --build rocm_build --config Release

I use the roo code plugin to test. For text question is fine, and it can accept image as well. the only issue is I simply give it a screenshot of a console output and ask it convert to text. The model give me something definitely not in the image.

However, I build the main of https://github.com/ggml-org/llama.cpp, the result is good.

Below are the command I used and the logs:

C:\Users\peter>C:\Users\peter\Desktop\apps\rocmfp4-llama\rocm_build\bin\llama-server -m D:\LLM_Models\Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf ^
More? --alias qwen3.6-27b-rocmfp4-mtp --host 0.0.0.0 --port 8082 ^
More? --mmproj D:\LLM_Models\Qwen3.5-27B-MTP-ROCmFP4-STRIX-imatrix-mmproj-F16.gguf ^
More? --image-min-tokens 1024 ^
More? -ngl 999 -fa on ^
More? -c 262144 -b 2048 -ub 256 -t 16 -tb 16 ^
More? -ctk f16 -ctv f16 ^
More? --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^
More? --presence-penalty 0.0 --repeat-penalty 1.0 ^
More? --spec-type draft-mtp --spec-draft-ngl all ^
More? --spec-draft-type-k f16 --spec-draft-type-v f16 ^
More? --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 ^
More? --reasoning on --reasoning-format deepseek ^
More? --chat-template-kwargs "{\"preserve_thinking\":true}" ^
More? --checkpoint-every-n-tokens 1024 --ctx-checkpoints 32 ^
More? --jinja --parallel 1 --metrics --no-mmap
0.00.086.041 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.086.050 I device_info:
0.00.177.171 I   - ROCm0   : AMD Radeon(TM) 8060S Graphics (89976 MiB, 89816 MiB free)
0.00.179.201 I   - Vulkan0 : AMD Radeon(TM) 8060S Graphics (98123 MiB, 93217 MiB free)
0.00.179.209 I   - CPU     : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S           (65174 MiB, 55565 MiB free)
0.00.179.283 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : FORCE_MMQ = 1 | NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | REPACK = 1 |
0.00.179.348 I srv          init: using 31 threads for HTTP server
0.00.179.355 I srv          init: the WebUI is disabled
0.00.179.464 I srv         start: binding port with default address family
0.00.184.313 I srv          main: loading model
0.00.184.325 I srv    load_model: loading model 'D:\LLM_Models\Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf'
0.00.184.411 I common_init_result: fitting params to device memory ...
0.00.184.413 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.15.902.961 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.16.191.816 I srv    load_model: creating MTP draft context against the target model 'D:\LLM_Models\Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf'
0.20.403.020 I srv    load_model: loaded multimodal model, 'D:\LLM_Models\Qwen3.5-27B-MTP-ROCmFP4-STRIX-imatrix-mmproj-F16.gguf'
0.20.403.031 I srv    load_model: initializing slots, n_slots = 1
0.20.503.986 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.20.681.596 I common_speculative_init: adding speculative implementation 'draft-mtp'
0.20.681.979 I srv    load_model: speculative decoding context initialized
0.20.681.982 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
0.20.682.199 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.20.682.201 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.20.682.202 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.20.682.393 I srv          init: idle slots will be saved to prompt cache upon starting a new task
0.20.710.537 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
<think>

</think>

Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.20.725.075 I srv          init: init: chat template, thinking = 1
0.20.725.345 I srv          main: model loaded
0.20.725.350 I srv          main: server is listening on http://0.0.0.0:8082
0.20.725.363 I srv  update_slots: all slots are idle
0.29.066.181 I srv  params_from_: Chat format: peg-native
0.29.068.874 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.29.068.878 I srv  get_availabl: updating prompt cache
0.29.069.014 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.29.069.019 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
0.29.069.021 I srv  get_availabl: prompt cache update took 0.14 ms
0.29.070.424 I reasoning-budget: activated, budget=2147483647 tokens
0.29.070.443 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.35.675.604 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   2048, progress = 0.22, t =   6.61 s / 310.06 tokens per second
0.35.676.565 I slot update_slots: id  0 | task 0 | 1024 tokens since last checkpoint at 0, creating new checkpoint during processing at position 4096
0.35.777.080 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 2047, pos_max = 2047, n_tokens = 2048, size = 157.665 MiB)
0.42.585.593 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   4096, progress = 0.44, t =  13.52 s / 303.07 tokens per second
0.42.586.545 I slot update_slots: id  0 | task 0 | 1024 tokens since last checkpoint at 2048, creating new checkpoint during processing at position 6144
0.42.690.980 I slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 4095, pos_max = 4095, n_tokens = 4096, size = 165.704 MiB)
0.49.791.911 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   6144, progress = 0.67, t =  20.72 s / 296.50 tokens per second
0.49.792.887 I slot update_slots: id  0 | task 0 | 1024 tokens since last checkpoint at 4096, creating new checkpoint during processing at position 8192
0.49.903.122 I slot create_check: id  0 | task 0 | created context checkpoint 3 of 32 (pos_min = 6143, pos_max = 6143, n_tokens = 6144, size = 173.743 MiB)
0.57.258.622 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8192, progress = 0.89, t =  28.19 s / 290.62 tokens per second
0.57.259.497 I slot update_slots: id  0 | task 0 | 1024 tokens since last checkpoint at 6144, creating new checkpoint during processing at position 8975
0.57.369.915 I slot create_check: id  0 | task 0 | created context checkpoint 4 of 32 (pos_min = 8191, pos_max = 8191, n_tokens = 8192, size = 181.782 MiB)
1.00.365.559 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8975, progress = 0.97, t =  31.30 s / 286.79 tokens per second
1.00.429.116 I slot create_check: id  0 | task 0 | created context checkpoint 5 of 32 (pos_min = 8974, pos_max = 8974, n_tokens = 8975, size = 184.856 MiB)
1.01.331.182 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   9231, progress = 1.00, t =  32.26 s / 286.14 tokens per second
1.01.446.109 I slot create_check: id  0 | task 0 | created context checkpoint 6 of 32 (pos_min = 9230, pos_max = 9230, n_tokens = 9231, size = 185.861 MiB)
1.02.711.568 I reasoning-budget: deactivated (natural end)
1.05.405.384 I slot print_timing: id  0 | task 0 | n_decoded =    101, tg =  26.17 t/s
1.07.820.465 I slot print_timing: id  0 | task 0 |
prompt eval time =   32474.22 ms /  9235 tokens (    3.52 ms per token,   284.38 tokens per second)
       eval time =    6275.49 ms /   161 tokens (   38.98 ms per token,    25.66 tokens per second)
      total time =   38749.71 ms /  9396 tokens
draft acceptance rate = 0.79861 (  115 accepted /   144 generated)
1.07.820.495 I statistics draft-mtp: #calls(b,g,a) = 1 48 48, #gen drafts = 48, #acc drafts = 48, #gen tokens = 144, #acc tokens = 115, dur(b,g,a) = 0.002, 1609.184, 0.049 ms
1.07.820.994 I slot      release: id  0 | task 0 | stop processing: n_tokens = 9398, truncated = 0
1.07.821.008 I srv  update_slots: all slots are idle
1.21.324.444 I srv  params_from_: Chat format: peg-native
1.21.326.205 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.868 (> 0.100 thold), f_keep = 0.983
1.21.326.686 I reasoning-budget: activated, budget=2147483647 tokens
1.21.326.892 I slot launch_slot_: id  0 | task 56 | processing task, is_child = 0
1.21.326.914 W slot update_slots: id  0 | task 56 | n_past = 9234, slot.prompt.tokens.size() = 9398, seq_id = 0, pos_min = 9397, n_swa = 0
1.21.326.916 I slot update_slots: id  0 | task 56 | Checking checkpoint with [9230, 9230] against 9234...
1.21.355.161 W slot update_slots: id  0 | task 56 | restored context checkpoint (pos_min = 9230, pos_max = 9230, n_tokens = 9231, n_past = 9231, size = 185.861 MiB)
1.21.932.756 I srv  process_chun: processing image...
1.22.819.241 W init: embeddings required but some input tokens were not marked as outputs -> overriding
1.22.822.364 W find_slot: non-consecutive token position 9388 after 9387 for sequence 0 with 256 new tokens
1.22.822.368 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.22.822.369 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.22.822.370 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.22.822.370 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 56 new tokens
1.22.822.901 W find_slot: non-consecutive token position 9388 after 9387 for sequence 0 with 256 new tokens
1.22.830.164 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.23.674.855 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.24.517.104 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 256 new tokens
1.25.395.258 W find_slot: non-consecutive token position 9388 after 9388 for sequence 0 with 56 new tokens
1.26.253.210 I srv  process_chun: image processed in 4320 ms
1.26.253.225 I srv  process_chun: processing image...
1.27.622.424 I srv  process_chun: image processed in 1370 ms
1.27.625.417 W find_slot: non-consecutive token position 9631 after 9388 for sequence 0 with 172 new tokens
1.27.625.820 W find_slot: non-consecutive token position 9631 after 9388 for sequence 0 with 172 new tokens
1.28.341.195 I slot print_timing: id  0 | task 56 | prompt processing, n_tokens =   1409, progress = 1.00, t =   7.01 s / 200.88 tokens per second
1.28.444.867 I slot create_check: id  0 | task 56 | created context checkpoint 7 of 32 (pos_min = 9631, pos_max = 9631, n_tokens = 10640, size = 191.392 MiB)
1.33.790.569 I slot print_timing: id  0 | task 56 | n_decoded =    101, tg =  19.24 t/s
1.36.900.757 I slot print_timing: id  0 | task 56 | n_decoded =    164, tg =  19.62 t/s
1.38.947.732 I reasoning-budget: deactivated (natural end)
1.39.980.929 I slot print_timing: id  0 | task 56 | n_decoded =    233, tg =  20.37 t/s
1.43.099.326 I slot print_timing: id  0 | task 56 | n_decoded =    310, tg =  21.29 t/s
1.46.103.735 I slot print_timing: id  0 | task 56 | n_decoded =    379, tg =  21.58 t/s
1.49.222.169 I slot print_timing: id  0 | task 56 | n_decoded =    454, tg =  21.95 t/s
1.50.274.785 I slot print_timing: id  0 | task 56 |
prompt eval time =    7212.69 ms /  1413 tokens (    5.10 ms per token,   195.90 tokens per second)
       eval time =   21734.39 ms /   485 tokens (   44.81 ms per token,    22.31 tokens per second)
      total time =   28947.07 ms /  1898 tokens
draft acceptance rate = 0.62897 (  317 accepted /   504 generated)
1.50.274.829 I statistics draft-mtp: #calls(b,g,a) = 2 216 216, #gen drafts = 216, #acc drafts = 192, #gen tokens = 648, #acc tokens = 432, dur(b,g,a) = 0.003, 7265.166, 0.335 ms
1.50.275.566 I slot      release: id  0 | task 56 | stop processing: n_tokens = 11129, truncated = 0
1.50.275.696 I srv  update_slots: all slots are idle

Hey! I'm pretty sure your build is totally fine, and this looks like it's just the mmproj file. Let me walk through what I found.

I went and reproduced your exact setup on my end — same fork commit, MTP, prompt cache, the --checkpoint-every-n-tokens / --ctx-checkpoints flags, --image-min-tokens 1024, image sent as a follow-up turn, on both the ROCm and Vulkan backends. I handed it a console screenshot and asked it to transcribe the text, and it got it right every time. So the fork's vision path is working well — and as I mentioned before, those non-consecutive token position warnings really are harmless, you can safely ignore them. 👍

On the mainline comparison — that one's a bit of a red herring, and it's worth clearing up because it had me scratching my head too. Mainline llama.cpp can't actually load this rocmfp4 model at all; the format doesn't exist there, so it errors out the moment it tries to read the file. So when vision works for you "on mainline," that's a different (standard) model running with its own bundled mmproj — not really the same test, and not a sign anything's wrong with your fork build.

Here's the one thing that jumps out. Look at your --mmproj:

Qwen3.5-27B-MTP-ROCmFP4-STRIX-imatrix-mmproj-F16.gguf

That's not the file I pointed you to — the unsloth one is just mmproj-F16.gguf from the Qwen3.6-27B-MTP repo, and yours says 3.5 with a different naming scheme. Do you remember where that one came from? If the projector is from a different model or version than the 3.6 weights you're running, you'd get exactly this behavior: text is fine, images get "accepted," but the fine details (like the text in a screenshot) come out wrong — because the vision features don't line up with the language model.

Could you try grabbing the mmproj straight from the same repo as the model — unsloth/Qwen3.6-27B-MTP-GGUF, the mmproj-F16.gguf (or mmproj-F32.gguf if it's there) — point --mmproj at that exact file, and retest with a fresh image? That's the setup I verified working cleanly.

I can't promise that's 100% it until you try it, but it's the one variable that differs from my working setup and it's a two-minute thing to rule out. If it's still off afterward, then we're probably looking at something Windows / ROCm-7.1-specific, and I'm happy to keep digging with you. Fingers crossed it's just the file! 🙂

puzert changed discussion status to closed

Sign up or log in to comment