Instructions to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF",
	filename="mmproj-qwen35-4b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Use Docker

docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

LM Studio
Jan

vLLM

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Ollama
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Ollama:
```
ollama run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
```

Unsloth Studio

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF to start chatting

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16
```

Lemonade

How to use FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF:F16

Run and chat with the model

lemonade run user.Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF-F16

List all available models

lemonade list

Qwen3.5-4B Instruct — MTP NVFP4 GGUF

NVFP4 (E4M3) 4-bit quantization of Qwen/Qwen3.5-4B, Qwen's 4B-parameter instruction-tuned multimodal model with hybrid DeltaNet-Mamba2-Attention architecture and 262K token context window. Includes MTP (Multi-Token Prediction) support.

This is an Instruct variant: the embedded chat template has been modified so that thinking (<think> reasoning traces) is disabled by default. Pass enable_thinking=true during inference to enable reasoning.

About NVFP4

NVFP4 (E4M3 — 1 sign, 4 exponent, 3 mantissa) is NVIDIA's native 4-bit floating-point format for Blackwell GPUs:

Feature	NVFP4
Format	E4M3 (1:4:3)
Block size	128 elements
Dynamic range	15 orders of magnitude (6-bit exp)
Zero-point	Implicit (true 0)
Hardware	Blackwell (RTX 50-series, B200)
Dequant cost	None (native support)

Unlike INT4 formats that require zero-point restoration and have limited dynamic range, NVFP4's 6-bit exponent preserves outlier-sensitive values while achieving 4× compression vs FP16.

Files

Filename	Type	Size	Description
`qwen35-4b-mtp-nvfp4.gguf`	NVFP4 quantized model	~2.5 GB	Main model weights with MTP head
`mmproj-qwen35-4b-f16.gguf`	F16 multimodal projector	~644 MB	Vision encoder for image inputs

Quantization Details

Aspect	Detail
Format	NVFP4 (E4M3)
Block size	128
Bits per weight	4.92
Hardware target	NVIDIA Blackwell (RTX 5090, RTX 5060 Ti, B200, etc.)
VRAM requirement	~4 GB (model + KV cache)
Source format	BF16 (original HF weights)
Quantization tool	llama-quantize (commit dd7cad7, CUDA 13.2)
MTP layers	1 (nextn)

Model Description

Qwen3.5-4B is Qwen's instruction-tuned model featuring:

3.97B parameters (dense)
Hybrid architecture: Gated DeltaNet + Gated Attention + FFN layers
Mamba2-style SSM via DeltaNet with gating mechanism
4 full attention layers at regular intervals (full_attention_interval=4)
262K context length (extensible to 1M)
248,320 vocabulary (GPT-2 tokenizer with Qwen3.5 pre-tokenizer)
Vision multimodal: image understanding via cross-attention projector
MTP (Multi-Token Prediction): trained with multi-step prediction for improved generation

The GGUF uses the QWEN35 architecture handler from llama.cpp with full support for all hybrid layer types.

Instruct Variant: Thinking Disabled by Default

The original Qwen3.5 chat template enables thinking by default — it outputs <think>\n at the start of every assistant response. This repository's GGUF ships with a modified chat template where the default behavior is inverted:

Scenario	Behavior
`enable_thinking` not set	❌ Thinking off — outputs `<think>\n\n</think>\n\n` (empty think block)
`enable_thinking=true`	✅ Thinking on — outputs `<think>\n` (reasoning trace expected)
`enable_thinking=false`	❌ Thinking off

Usage

llama.cpp CLI

# Text-only inference (thinking off by default)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  -p "Explain quantum computing in simple terms" -n 512

# With thinking enabled
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  -p "Solve this math problem step by step" -n 512

# Multimodal (image input)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  --mmproj mmproj-qwen35-4b-f16.gguf \
  --image path/to/image.jpg -p "Describe this image" -n 256

Download via huggingface-hub

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF",
    filename="qwen35-4b-mtp-nvfp4.gguf",
)

Conversion Pipeline

Downloaded original BF16 weights from Qwen/Qwen3.5-4B
Converted to F16 GGUF with MTP tensors included
Extracted vision projector as separate mmproj F16 GGUF
Quantized to NVFP4 via llama-quantize.exe NVFP4
Patched chat template for thinking-disabled-by-default
Uploaded to HuggingFace Hub

Verification

from gguf import GGUFReader
r = GGUFReader("qwen35-4b-mtp-nvfp4.gguf")
print(f"Tensors: {len(r.tensors)}")
print(f"MTP layers: {r.fields['qwen35.nextn_predict_layers'].parts[-1]}")

Hardware

Component	Detail
GPU	NVIDIA Blackwell (RTX 5060 Ti)
CUDA Toolkit	13.2
System RAM	64 GB

License

Apache-2.0 (same as the original Qwen3.5-4B model).

Downloads last month: 531

GGUF

Model size

4B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(235)

this model