Qwen2.5-VL-3B-Instruct โ€” GGUF Quantizations (VLM)

Model on HF Original Model quant-kit

Quantized GGUF versions of Qwen/Qwen2.5-VL-3B-Instruct

This is a Vision-Language Model (VLM) โ€” it can understand both text and images.

Works with llama.cpp ยท LM Studio ยท Jan ยท Ollama

Quantized by Dhptl on June 11, 2026 using quant-kit


This VLM requires TWO files โ€” a text backbone GGUF and the mmproj vision encoder GGUF. Download one text backbone (e.g. Q4_K_M) and the mmproj file. Both must be in the same folder.


๐Ÿ“ฆ Available Files

๐Ÿ”ค Text Backbone (quantized โ€” pick ONE)

Filename Size RAM Required Quant Quality Best For
Qwen2.5-VL-3B-Instruct-Q2_K.gguf 1.19 GB ~2.7 GB Q2_K โญ Extreme compression, significant quality loss.
Qwen2.5-VL-3B-Instruct-Q3_K_L.gguf 1.59 GB ~3.1 GB Q3_K_L โญโญโญ Slightly better than Q3_K_M, still a compromise.
Qwen2.5-VL-3B-Instruct-Q3_K_M.gguf 1.48 GB ~3.0 GB Q3_K_M โญโญโญ Very small file. Quality drop noticeable.
Qwen2.5-VL-3B-Instruct-Q3_K_S.gguf 1.35 GB ~2.9 GB Q3_K_S โญโญ Very high compression, high quality loss.
Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf 1.80 GB ~3.3 GB Q4_K_M โœ… Recommended โญโญโญโญ Best balance of size and quality. Recommended for most users.
Qwen2.5-VL-3B-Instruct-Q4_K_S.gguf 1.71 GB ~3.2 GB Q4_K_S โญโญโญยฝ Good speed/size balance, slight quality loss.
Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf 2.07 GB ~3.6 GB Q5_K_M โญโญโญโญยฝ Better quality than Q4, slightly larger. Great if you have the RAM.
Qwen2.5-VL-3B-Instruct-Q5_K_S.gguf 2.02 GB ~3.5 GB Q5_K_S โญโญโญโญ Large but accurate.
Qwen2.5-VL-3B-Instruct-Q6_K.gguf 2.36 GB ~3.9 GB Q6_K โญโญโญโญโญ Near-perfect quality, very large.
Qwen2.5-VL-3B-Instruct-Q8_0.gguf 3.06 GB ~4.6 GB Q8_0 โญโญโญโญโญ Closest to original quality. Use when RAM is not a concern.

๐Ÿ–ผ๏ธ Vision Encoder โ€” mmproj (always required, always F16)

Filename Size Notes
Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf 1.25 GB Always F16 โ€” vision encoder is not quantized

โš ๏ธ You need BOTH files โ€” one text backbone + the mmproj โ€” to run this VLM.


โšก Speed Benchmarks

Run python benchmark.py --model Qwen2.5-VL-3B-Instruct to generate results.


๐Ÿš€ How to Use

LM Studio (Easiest โ€” GUI)

  1. Search for Dhptl/Qwen2.5-VL-3B-Instruct in LM Studio
  2. Download the Q4_K_M text file and the mmproj file
  3. Load the model โ€” LM Studio automatically uses both files

Ollama

ollama run dhptl/qwen2.5-vl-3b-instruct

llama.cpp CLI โ€” Text + Image

# Download both files to the same directory, then:
./llama-llava-cli \
  -m Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf \
  --mmproj Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf \
  --image /path/to/your/image.jpg \
  -p "Describe this image in detail." \
  -n 512

llama.cpp CLI โ€” Text only (no image)

./llama-cli \
  -m Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf \
  -p "You are a helpful assistant." \
  --conversation

Python โ€” llama-cpp-python

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava16ChatHandler

# Load VLM with mmproj
chat_handler = Llava16ChatHandler(clip_model_path="./Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf")
llm = Llama(
    model_path="./Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf",
    chat_handler=chat_handler,
    n_gpu_layers=-1,
    n_ctx=4096,
    logits_all=True,
)

# Text + image inference
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text",      "text":      "What do you see in this image?"}
            ]
        }
    ]
)
print(response["choices"][0]["message"]["content"])

๐Ÿ” VLM Architecture

This model uses a two-component architecture:

Component File Purpose
Text Backbone Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf Language understanding & generation
Vision Encoder (mmproj) Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf Image feature extraction (always F16)

Why is mmproj always F16? The vision encoder maps image pixels to token embeddings. Quantizing it causes visible visual artifacts and degraded image understanding. It stays at F16 (half precision) which is already very efficient at ~1-2GB for most models.


๐Ÿ” About GGUF Quantization

Format Bits/weight Quality
Q3_K_M ~3.3 โญโญโญ
Q4_K_M ~4.5 โญโญโญโญ โ† recommended
Q5_K_M ~5.6 โญโญโญโญยฝ
Q8_0 ~8.5 โญโญโญโญโญ

๐Ÿ’ฌ Community & Feedback

Found an issue? Open a Discussion in the Community tab.

If useful, please:

  • โญ Star quant-kit on GitHub
  • ๐Ÿ‘ Like this model on HuggingFace
Downloads last month
370
GGUF
Model size
3B params
Architecture
qwen2vl
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dhptl/Qwen2.5-VL-3B-Instruct-GGUF

Quantized
(83)
this model