Instructions to use batiai/Qwen3-VL-Embedding-2B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Qwen3-VL-Embedding-2B-GGUF", filename="Qwen3-VL-Embedding-2B-Q6_K.gguf", )
llm.create_chat_completion( messages = "{\n \"source_sentence\": \"That is a happy person\",\n \"sentences\": [\n \"That is a happy dog\",\n \"That is a very happy person\",\n \"Today is a sunny day\"\n ]\n}" ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K # Run inference directly in the terminal: ./llama-cli -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Use Docker
docker model run hf.co/batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
- LM Studio
- Jan
- Ollama
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Ollama:
ollama run hf.co/batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
- Unsloth Studio
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-VL-Embedding-2B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-VL-Embedding-2B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/Qwen3-VL-Embedding-2B-GGUF to start chatting
- Pi
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Run Hermes
hermes
- Docker Model Runner
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Docker Model Runner:
docker model run hf.co/batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
- Lemonade
How to use batiai/Qwen3-VL-Embedding-2B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/Qwen3-VL-Embedding-2B-GGUF:Q6_K
Run and chat with the model
lemonade run user.Qwen3-VL-Embedding-2B-GGUF-Q6_K
List all available models
lemonade list
Qwen3-VL-Embedding-2B GGUF — Quantized by BatiAI
GGUF quantizations of Qwen/Qwen3-VL-Embedding-2B — the most-downloaded vision-language embedding model of 2026 (1.64 M downloads on HF). Part of BatiAI's on-device RAG stack for BatiFlow.
What does it do?
VL (Vision-Language) embedding turns either text OR images into dense vectors. Same embedding space means:
- Search photos by text — "beach sunset" retrieves matching photos without manual tagging
- Search text by image — drop a screenshot, find similar notes
- Cross-modal RAG — index PDFs, notes, and images together in one vector DB
Quick Start
Text embedding (llama.cpp, via Ollama)
ollama pull batiai/qwen3-vl-embed-2b:q8
curl http://localhost:11434/api/embeddings -d '{
"model": "batiai/qwen3-vl-embed-2b:q8",
"prompt": "What is the capital of France?"
}'
Image embedding
Image embedding requires llama.cpp's mtmd (multimodal) build. See Qwen3-VL docs for batch image encoding.
Available Quantizations
| File | Quant | Size | Recommended |
|---|---|---|---|
Qwen3-VL-Embedding-2B-Q6_K.gguf |
Q6_K | ~1.5 GB | balanced (recommended default) |
Qwen3-VL-Embedding-2B-Q8_0.gguf |
Q8_0 | ~1.8 GB | near-lossless embeddings |
Embedding models are sensitive to low-bit quantization (vector quality drops). Q6_K minimum.
Quality note
Direct embedding-quality eval (e.g. MTEB retrieval) is more involved than rerank pairwise testing and takes longer to run locally. Our sibling reranker card shows that Q6_K ↔ Q8_0 drift is negligible (Pearson r = 0.998 on 40 pairs) for the same model family — we expect the embedding model to behave similarly. MTEB/BEIR numbers will be added as measured.
Why Qwen3-VL-Embedding?
- SOTA on MTEB — top multilingual embedding model across text + image
- Multilingual — en / ko / ja / zh
- Multimodal — text and image in the same embedding space
- 2048-dim vectors — balance between expressiveness and storage
Why BatiAI?
- Quantized directly from Alibaba's BF16 safetensors
- BatiAI-signed metadata
- Part of a full on-device RAG stack
Technical Details
- Original Model: Qwen/Qwen3-VL-Embedding-2B
- Architecture: Qwen3-VL with pooling head
- Parameters: 2 B (text tower) + vision tower
- Embedding dim: 2048
- Max context: 32 K (text)
- License: Apache 2.0
- Quantized with: llama.cpp
BatiAI's RAG Stack
| Role | Model | HF |
|---|---|---|
| VL Embedding (2 B) | Qwen3-VL-Embedding-2B | this repo |
| Reranker (0.6 B) | Qwen3-Reranker-0.6B | batiai/Qwen3-Reranker-0.6B-GGUF |
| Reranker (4 B) | Qwen3-Reranker-4B | batiai/Qwen3-Reranker-4B-GGUF |
| Chat LLM (35 B-A3B) | Qwen3.6-35B-A3B | batiai/Qwen3.6-35B-A3B-GGUF |
License
Mirrors upstream Qwen Apache 2.0. Commercial use permitted.
- Downloads last month
- 308
6-bit
8-bit