Instructions to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF", filename="Nvidia-Qwen3.6-27B-NVFP4-A.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama cli -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama cli -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./llama-cli -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./build/bin/llama-cli -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Use Docker
docker model run hf.co/CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
- LM Studio
- Jan
- vLLM
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
- Ollama
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Ollama:
ollama run hf.co/CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
- Unsloth Studio
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF to start chatting
- Pi
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Docker Model Runner:
docker model run hf.co/CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
- Lemonade
How to use CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4
Run and chat with the model
lemonade run user.Nvidia-Qwen3.6-27B-NVFP4-GGUF-NVFP4
List all available models
lemonade list
Nvidia-Qwen3.6-27B-NVFP4 - GGUF
Quantized GGUF versions of nvidia/Qwen3.6-27B-NVFP4. These were generated using llama.cpp's convert_hf_to_gguf.py (b9859).
Nvidia-Qwen3.6-27B-NVFP4-A.gguf- All layers are NVFP4 quantized. This required modifyingconvert_hf_to_gguf.py, and needs cleaning up before possible upstreaming.Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf: NVFP4 FFN layers are preserved, while FP8 attention layers are upcasted to BF16. This is the default conversion for BF16 because GGUF files do not support FP8.
Quantizations provided
| File | Quantization | Size |
|---|---|---|
| Nvidia-Qwen3.6-27B-NVFP4-A.gguf | NVFP4 | 17.9 GB |
| Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf | NVFP4 FFN, BF16 attention | 28.2 GB |
Perplexity test
I tested perplexity using llama-perplexity and Salesforce's wikitext-2-raw-v1.
| File | Ctx | PPL |
|---|---|---|
| Nvidia-Qwen3.6-27B-NVFP4-A.gguf | 512 | 7.7540 ± 0.05396 |
| Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf | 512 | 7.4814 ± 0.05157 |
Evaluation
The following models were evaluated for a fair comparison of capability, size and speed.
| Model | Quantization | Size | Reason |
|---|---|---|---|
| unsloth/Qwen3.6-27B-MTP-GGUF | UD-Q4_K_XL | 17.9 GB | Closest non-NVFP4 in size to NVFP4. |
| unsloth/Qwen3.6-27B-MTP-GGUF | UD-Q6_K_XL | 26 GB | Closest non-NVFP4 in size to BF16-Attn. |
| unsloth/Qwen3.6-27B-NVFP4 | NVFP41 | 25.4 GB | Alternative NVFP4 quant. |
1: unsloth/Qwen3.6-27B-NVFP4 does not provide a GGUF. I used llama.cpp's conversion which passes through Unsloth's NVFP4 tensors.
| CodeFault NVFP4 |
CodeFault BF16-Attn |
Unsloth NVFP4 |
Unsloth UD-Q4_K_XL |
Unsloth UD-Q6_K_XL |
|
|---|---|---|---|---|---|
| Coding | |||||
| HumanEval | 0.8415 ± 0.0286 | 0.8354 ± 0.029 | 0.811 ± 0.0307 | 0.8354 ± 0.029 | 0.8537 ± 0.0277 |
| HumanEval+ | 0.7866 ± 0.0321 | 0.7927 ± 0.0318 | 0.7744 ± 0.0327 | 0.7805 ± 0.0324 | 0.7805 ± 0.0324 |
| MBPP | 0.006 ± 0.0035!! | 0.754 ± 0.0193 | 0.742 ± 0.0196 | 0.756 ± 0.0192 | 0.754 ± 0.0193 |
| MBPP+ | 0.0106 ± 0.0053!! | 0.8836 ± 0.0165 | 0.8995 ± 0.0155 | 0.8968 ± 0.0157 | 0.8836 ± 0.0165 |
| Instruction | |||||
| IFEval | 0.8447 ± 0.0156 | 0.841 ± 0.0157 | 0.8447 ± 0.0156 | ||
| Knowledge | |||||
| ARC-Challenge | 0.9659 ± 0.0053 | 0.971 ± 0.0049 | 0.971 ± 0.0049 | 0.971 ± 0.0049 | 0.971 ± 0.0049 |
| MMLU-Pro | 0.835 ± 0.0033 | ||||
| STEM & Reasoning | |||||
| BIG-Bench Hard | 0.926 ± 0.003 | ||||
| GPQA Diamond | |||||
| GSM8K | 0.9098 ± 0.0079 | 0.9083 ± 0.008 | 0.9158 ± 0.0076 | ||
| Hendrycks Math |
NOTICE: These tests are actively running.
!!: Such a drastic failure suggests something is wrong with the harness, not the model. I still need to investigate.
These evaluations were run using lm_eval. The models were run in instruct (non-thinking) mode with the following parameters in llama-server (b9775):
ctx-size = 32768
cache-type-k = q8_0
cache-type-v = q8_0
top-k = 20
top-p = 0.8
min-p = 0
presence-penalty = 1.5
spec_type = draft-mtp
spec_draft_n_max = 2
chat-template-kwargs = {"enable_thinking":false}
Benchmarks
Benchmarks: Coming after evaluations.
Serving with llama.cpp
It has a max context size of 262,114. This can be served using:
llama-server \
-hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--repeat-penalty 1.1 \
--spec-type draft-mtp \
--spec-draft-n-max 2
- Downloads last month
- 920
4-bit
16-bit