Instructions to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF", filename="mmproj-qwen35-4b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Use Docker
docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
- Ollama
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Ollama:
ollama run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
- Unsloth Studio
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting
- Pi
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Docker Model Runner:
docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
- Lemonade
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
Run and chat with the model
lemonade run user.Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF-F16
List all available models
lemonade list
Qwen3.5-4B Instruct — MTP NVFP4 GGUF
NVFP4 (E4M3) 4-bit quantization of Qwen/Qwen3.5-4B, Qwen's 4B-parameter instruction-tuned multimodal model with hybrid DeltaNet-Mamba2-Attention architecture and 262K token context window. Includes MTP (Multi-Token Prediction) support.
This is an Instruct variant: the embedded chat template has been modified so that thinking (<think> reasoning traces) is disabled by default. Pass enable_thinking=true during inference to enable reasoning.
About NVFP4
NVFP4 (E4M3 — 1 sign, 4 exponent, 3 mantissa) is NVIDIA's native 4-bit floating-point format for Blackwell GPUs:
| Feature | NVFP4 |
|---|---|
| Format | E4M3 (1:4:3) |
| Block size | 128 elements |
| Dynamic range | 15 orders of magnitude (6-bit exp) |
| Zero-point | Implicit (true 0) |
| Hardware | Blackwell (RTX 50-series, B200) |
| Dequant cost | None (native support) |
Unlike INT4 formats that require zero-point restoration and have limited dynamic range, NVFP4's 6-bit exponent preserves outlier-sensitive values while achieving 4× compression vs FP16.
Files
| Filename | Type | Size | Description |
|---|---|---|---|
qwen35-4b-mtp-nvfp4.gguf |
NVFP4 quantized model | ~2.5 GB | Main model weights with MTP head |
mmproj-qwen35-4b-f16.gguf |
F16 multimodal projector | ~644 MB | Vision encoder for image inputs |
Quantization Details
| Aspect | Detail |
|---|---|
| Format | NVFP4 (E4M3) |
| Block size | 128 |
| Bits per weight | 4.92 |
| Hardware target | NVIDIA Blackwell (RTX 5090, RTX 5060 Ti, B200, etc.) |
| VRAM requirement | ~4 GB (model + KV cache) |
| Source format | BF16 (original HF weights) |
| Quantization tool | llama-quantize (commit dd7cad7, CUDA 13.2) |
| MTP layers | 1 (nextn) |
Model Description
Qwen3.5-4B is Qwen's instruction-tuned model featuring:
- 3.97B parameters (dense)
- Hybrid architecture: Gated DeltaNet + Gated Attention + FFN layers
- Mamba2-style SSM via DeltaNet with gating mechanism
- 4 full attention layers at regular intervals (full_attention_interval=4)
- 262K context length (extensible to 1M)
- 248,320 vocabulary (GPT-2 tokenizer with Qwen3.5 pre-tokenizer)
- Vision multimodal: image understanding via cross-attention projector
- MTP (Multi-Token Prediction): trained with multi-step prediction for improved generation
The GGUF uses the QWEN35 architecture handler from llama.cpp with full support for all hybrid layer types.
Instruct Variant: Thinking Disabled by Default
The original Qwen3.5 chat template enables thinking by default — it outputs <think>\n at the start of every assistant response. This repository's GGUF ships with a modified chat template where the default behavior is inverted:
| Scenario | Behavior |
|---|---|
enable_thinking not set |
❌ Thinking off — outputs <think>\n\n</think>\n\n (empty think block) |
enable_thinking=true |
✅ Thinking on — outputs <think>\n (reasoning trace expected) |
enable_thinking=false |
❌ Thinking off |
Usage
llama.cpp CLI
# Text-only inference (thinking off by default)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
-p "Explain quantum computing in simple terms" -n 512
# With thinking enabled
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
-p "Solve this math problem step by step" -n 512
# Multimodal (image input)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
--mmproj mmproj-qwen35-4b-f16.gguf \
--image path/to/image.jpg -p "Describe this image" -n 256
Download via huggingface-hub
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF",
filename="qwen35-4b-mtp-nvfp4.gguf",
)
Conversion Pipeline
- Downloaded original BF16 weights from Qwen/Qwen3.5-4B
- Converted to F16 GGUF with MTP tensors included
- Extracted vision projector as separate mmproj F16 GGUF
- Quantized to NVFP4 via
llama-quantize.exe NVFP4 - Patched chat template for thinking-disabled-by-default
- Uploaded to HuggingFace Hub
Verification
from gguf import GGUFReader
r = GGUFReader("qwen35-4b-mtp-nvfp4.gguf")
print(f"Tensors: {len(r.tensors)}")
print(f"MTP layers: {r.fields['qwen35.nextn_predict_layers'].parts[-1]}")
Hardware
| Component | Detail |
|---|---|
| GPU | NVIDIA Blackwell (RTX 5060 Ti) |
| CUDA Toolkit | 13.2 |
| System RAM | 64 GB |
License
Apache-2.0 (same as the original Qwen3.5-4B model).
- Downloads last month
- 531
4-bit