Instructions to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF", filename="glm-4.7-flash-nvfp4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./llama-cli -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./build/bin/llama-cli -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Use Docker
docker model run hf.co/FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
- LM Studio
- Jan
- vLLM
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
- Ollama
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Ollama:
ollama run hf.co/FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
- Unsloth Studio
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF to start chatting
- Pi
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Docker Model Runner:
docker model run hf.co/FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
- Lemonade
How to use FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF:NVFP4
Run and chat with the model
lemonade run user.GLM-4.7-Flash-NVFP4-GGUF-NVFP4
List all available models
lemonade list
GLM-4.7-Flash-NVFP4-GGUF
GGUF quantization of zai-org/GLM-4.7-Flash โ a 30B-parameter Mixture-of-Experts language model with ~3.2B active parameters per token, built on the DeepSeek2 architecture with Multi-head Latent Attention (MLA) and 64 routed experts.
Quantized to NVFP4 format for efficient inference with minimal quality loss.
About NVFP4
NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3) for Blackwell GPUs. It stores weights in FP4 with a shared per-block scale, enabling native Blackwell tensor core acceleration with no dequantization overhead during inference. Compared to INT4 formats, NVFP4 offers better dynamic range (E4M3 vs E2M1) and maintains higher quality at similar bit widths.
Files
| Filename | Type | Size | Description |
|---|---|---|---|
glm-4.7-flash-nvfp4.gguf |
GGUF (NVFP4) | 15.79 GB | Quantized model weights |
README.md |
Markdown | - | Model card |
Quantization Details
| Property | Value |
|---|---|
| Format | NVFP4 |
| Bits Per Weight | 4.53 BPW |
| File Size | 15.79 GB |
| Tensor Count | 844 |
| Architecture | DeepSeek2 (custom for GLM-4.7-Flash) |
Model Description
- Developer: Zhipu AI
- Architecture: Mixture-of-Experts (MoE) with DeepSeek2-style MLA
- Parameters: ~30B total, ~3.2B active per token
- Context Length: 200,000 tokens
- Layers: 47 transformer layers
- Attention: Multi-head Latent Attention (q_lora_rank=768, kv_lora_rank=512)
- Experts: 64 routed experts (4 per token) + 1 shared expert
- Vocab Size: 151,936
- Languages: English, Chinese
- Thinking: Enabled by default (native
<think>/</think>tokens, hidden in history for clean multi-turn reasoning) - Pipeline: text-generation only (no vision encoder)
Usage
llama.cpp
# Basic generation
./llama-cli -m glm-4.7-flash-nvfp4.gguf \
-p "Hello, how are you?" \
-n 256
# With thinking/reasoning controlled
./llama-cli -m glm-4.7-flash-nvfp4.gguf \
-p "Solve this step by step: 23 * 47" \
-n 512 \
-no-cnv
HuggingFace Hub
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF",
filename="glm-4.7-flash-nvfp4.gguf",
repo_type="model"
)
Pipeline Commands
Source: zai-org/GLM-4.7-Flash (58 GB, 48 safetensor shards)
F16 GGUF Conversion:
python convert_hf_to_gguf.py D:\AI_MODELS\glm-4.7-src --outfile glm-4.7-f16.gguf --outtype f16Output: 55.79 GB, 844 tensors (DeepSeek2 arch, Glm4MoeLiteModel)
NVFP4 Quantization:
llama-quantize.exe glm-4.7-f16.gguf glm-4.7-flash-nvfp4.gguf NVFP4Duration: ~310s on RTX 5060 Ti
Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 5060 Ti 16 GB (Blackwell) |
| System RAM | 64 GB |
| Storage | D: (NVMe) |
License
MIT โ same as the original zai-org/GLM-4.7-Flash.
- Downloads last month
- -
4-bit
Model tree for FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF
Base model
zai-org/GLM-4.7-Flash