YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- Mount GLM-5.2 Locally Without Downloading β Complete Guide
- Model: zai-org/GLM-5.2 (753B params, MoE, safetensors)
- Using: hf-mount + HF_TOKEN + Modelfile/Manifest + vLLM/Ollama
- ============================================================================
- PREREQUISITES
- ============================================================================
-
- 1. Hugging Face Access Token (Read scope)
- β https://huggingface.co/settings/tokens
-
- 2. The model is in your bucket (you said you added it)
- β Your bucket path: adminglory/
- β OR mount the original repo: zai-org/GLM-5.2
-
- 3. Operating System:
- β Linux x86_64 / aarch64, macOS Apple Silicon, or Windows (NFS only)
-
- ============================================================================
- ============================================================================
- STEP 1: SET YOUR HF TOKEN
- ============================================================================
- Optional: persist in shell profile
- ============================================================================
- STEP 2: INSTALL hf-mount
- ============================================================================
- --- macOS or Linux (Homebrew) ---
- --- Linux without Homebrew (install script) ---
- This installs to ~/.local/bin/ β add to PATH if needed:
- --- Windows (NFS backend only) ---
- Download from: https://github.com/huggingface/hf-mount/releases/latest
- Enable NFS Client:
- Enable-WindowsOptionalFeature -Online -FeatureName ServicesForNFS-ClientOnly,ClientForNFS-Infrastructure -All
- Run as Administrator (port 111 is privileged)
- For FUSE backend on Linux (optional, tighter kernel integration):
- ============================================================================
- STEP 3: MOUNT THE MODEL (NO FULL DOWNLOAD β LAZY LOADING)
- ============================================================================
- --- Option A: Mount from YOUR bucket (read-write) ---
- --- Option B: Mount from the original model repo (read-only) ---
- --- Option C: Mount a GGUF version for Ollama (read-only) ---
- --- Option D: FUSE backend (tighter integration, requires fuse3/macFUSE) ---
- Verify the mount:
- You should see: config.json, model-*.safetensors, tokenizer.json, etc.
- Files are fetched LAZILY β only bytes you read hit the network.
- ============================================================================
- STEP 4: VERIFY MOUNTED FILES
- ============================================================================
- Check config exists
- List weight files
- List tokenizer files
- ============================================================================
- STEP 5A: SERVE WITH vLLM (RECOMMENDED β supports safetensors directly)
- ============================================================================
- vLLM supports more architectures than Ollama and can load safetensors
- from any local path (including hf-mount paths).
-
- IMPORTANT: For NFS/network mounts, use --safetensors-load-strategy eager
- to avoid inefficient random reads. vLLM auto-detects NFS and may
- auto-enable prefetching if the checkpoint fits in 90% of RAM.
- Serve from the mounted path:
- Test the endpoint:
- ============================================================================
- STEP 5B: SERVE WITH OLLAMA (requires GGUF format)
- ============================================================================
- NOTE: GLM-5.2 uses "glm_moe_dsa" architecture.
- Ollama's safetensors import ONLY supports: Llama, Mistral, Gemma, Phi3.
- Therefore, you MUST use a GGUF version for Ollama.
-
- Available GGUF quants:
- - unsloth/GLM-5.2-GGUF (most popular, 125K downloads)
- - cfontes/GLM-5.2-Q4_K_M-GGUF
- - Abiray/GLM-5.2-Q4_K_M-GGUF
- 1. Mount the GGUF repo:
- 2. List available GGUF files (pick a quant that fits your RAM/VRAM):
- 3. Create a Modelfile (see STEP 6 below)
- 4. Build the Ollama model:
- 5. Run it:
- ============================================================================
- STEP 6: MODELFILE (for Ollama β GGUF required)
- ============================================================================
- File: Modelfile
- Place this file in your working directory.
- Adjust the GGUF filename and quant to match what you mounted.
- >>> CUT HERE β Modelfile content <<<
- FROM /mnt/glm52-gguf/GLM-5.2-Q4_K_M.gguf
-
- # Model parameters
- PARAMETER temperature 0.7
- PARAMETER top_p 0.9
- PARAMETER top_k 50
- PARAMETER num_ctx 32768
- PARAMETER repeat_penalty 1.1
- PARAMETER stop "<|endoftext|>"
- PARAMETER stop "<|user|>"
- PARAMETER stop "<|assistant|>"
-
- # System prompt
- SYSTEM """You are GLM-5.2, a helpful AI assistant. Respond clearly and concisely."""
-
- # Chat template (GLM format)
- TEMPLATE """{{ if .System }}<|system|>
- {{ .System }}<|endoftext|>
- {{ end }}{{ if .Prompt }}<|user|>
- {{ .Prompt }}<|endoftext|>
- {{ end }}<|assistant|>
- {{ .Response }}<|endoftext|>
- """
- >>> END Modelfile <<<
- ============================================================================
- STEP 7: OLLAMA MANIFEST (auto-generated)
- ============================================================================
- When you run
ollama create glm-5.2 -f Modelfile, Ollama automatically - creates a manifest at:
- ~/.ollama/models/manifests/registry.ollama.ai/library/glm-5.2/latest
-
- The manifest is a JSON file that maps layers to SHA256 blob digests:
-
- {
- "schemaVersion": 2,
- "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
- "config": {
- "mediaType": "application/vnd.ollama.image.model",
- "digest": "sha256:",
- "size":
- },
- "layers": [
- {
- "mediaType": "application/vnd.ollama.image.model",
- "digest": "sha256:",
- "size":
- }
- ]
- }
-
- To inspect the manifest:
- To view the raw manifest:
- ============================================================================
- STEP 8: UNMOUNT WHEN DONE
- ============================================================================
- NFS or FUSE (macOS):
- FUSE (Linux):
- Or use the daemon stop command:
- ============================================================================
- ALTERNATIVE: USE HF INFERENCE ENDPOINT (no local GPU needed)
- ============================================================================
- GLM-5.2 has live inference providers on HF:
- novita, together, fireworks-ai, featherless-ai, zai-org, deepinfra
-
- You can call it via the HF Inference API with your token:
- Or use the OpenAI-compatible endpoint via HF Router:
- ============================================================================
- ARCHITECTURE COMPATIBILITY SUMMARY
- ============================================================================
-
- | Tool | Format Needed | GLM-5.2 Supported? | Mount Works? |
- |-----------|---------------|---------------------|--------------|
- | vLLM | safetensors | YES (remote_code) | YES |
- | Ollama | GGUF | YES (via GGUF) | YES |
- | Ollama | safetensors | NO (arch not listed)| N/A |
- | TGI | safetensors | YES | YES |
- | llama.cpp | GGUF | YES (via GGUF) | YES |
- | HF API | N/A (remote) | YES (live providers)| N/A |
Mount GLM-5.2 Locally Without Downloading β Complete Guide
Model: zai-org/GLM-5.2 (753B params, MoE, safetensors)
Using: hf-mount + HF_TOKEN + Modelfile/Manifest + vLLM/Ollama
============================================================================
PREREQUISITES
============================================================================
1. Hugging Face Access Token (Read scope)
β https://huggingface.co/settings/tokens
2. The model is in your bucket (you said you added it)
β Your bucket path: adminglory/
β OR mount the original repo: zai-org/GLM-5.2
3. Operating System:
β Linux x86_64 / aarch64, macOS Apple Silicon, or Windows (NFS only)
============================================================================
============================================================================
STEP 1: SET YOUR HF TOKEN
============================================================================
export HF_TOKEN="hf_your_token_here"
Optional: persist in shell profile
echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.bashrc # or ~/.zshrc source ~/.bashrc
============================================================================
STEP 2: INSTALL hf-mount
============================================================================
--- macOS or Linux (Homebrew) ---
brew install hf-mount
--- Linux without Homebrew (install script) ---
curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh
This installs to ~/.local/bin/ β add to PATH if needed:
export PATH="$HOME/.local/bin:$PATH"
--- Windows (NFS backend only) ---
Download from: https://github.com/huggingface/hf-mount/releases/latest
Enable NFS Client:
Enable-WindowsOptionalFeature -Online -FeatureName ServicesForNFS-ClientOnly,ClientForNFS-Infrastructure -All
Run as Administrator (port 111 is privileged)
For FUSE backend on Linux (optional, tighter kernel integration):
sudo apt-get install -y fuse3
============================================================================
STEP 3: MOUNT THE MODEL (NO FULL DOWNLOAD β LAZY LOADING)
============================================================================
--- Option A: Mount from YOUR bucket (read-write) ---
hf-mount start --hf-token $HF_TOKEN bucket adminglory/your-bucket-name /mnt/glm52
--- Option B: Mount from the original model repo (read-only) ---
hf-mount start --hf-token $HF_TOKEN repo zai-org/GLM-5.2 /mnt/glm52
--- Option C: Mount a GGUF version for Ollama (read-only) ---
hf-mount start --hf-token $HF_TOKEN repo unsloth/GLM-5.2-GGUF /mnt/glm52-gguf
--- Option D: FUSE backend (tighter integration, requires fuse3/macFUSE) ---
hf-mount start --fuse --hf-token $HF_TOKEN repo zai-org/GLM-5.2 /mnt/glm52
Verify the mount:
ls -la /mnt/glm52/
You should see: config.json, model-*.safetensors, tokenizer.json, etc.
Files are fetched LAZILY β only bytes you read hit the network.
============================================================================
STEP 4: VERIFY MOUNTED FILES
============================================================================
Check config exists
cat /mnt/glm52/config.json | head -20
List weight files
ls -lh /mnt/glm52/*.safetensors
List tokenizer files
ls /mnt/glm52/tokenizer*
============================================================================
STEP 5A: SERVE WITH vLLM (RECOMMENDED β supports safetensors directly)
============================================================================
vLLM supports more architectures than Ollama and can load safetensors
from any local path (including hf-mount paths).
IMPORTANT: For NFS/network mounts, use --safetensors-load-strategy eager
to avoid inefficient random reads. vLLM auto-detects NFS and may
auto-enable prefetching if the checkpoint fits in 90% of RAM.
pip install vllm
Serve from the mounted path:
vllm serve /mnt/glm52
--served-model-name glm-5.2
--host 0.0.0.0
--port 8000
--tensor-parallel-size 8
--gpu-memory-utilization 0.90
--safetensors-load-strategy eager
--max-model-len 32768
--trust-remote-code
Test the endpoint:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "glm-5.2",
"messages": [
{"role": "user", "content": "Hello! What can you do?"}
]
}'
============================================================================
STEP 5B: SERVE WITH OLLAMA (requires GGUF format)
============================================================================
NOTE: GLM-5.2 uses "glm_moe_dsa" architecture.
Ollama's safetensors import ONLY supports: Llama, Mistral, Gemma, Phi3.
Therefore, you MUST use a GGUF version for Ollama.
Available GGUF quants:
- unsloth/GLM-5.2-GGUF (most popular, 125K downloads)
- cfontes/GLM-5.2-Q4_K_M-GGUF
- Abiray/GLM-5.2-Q4_K_M-GGUF
1. Mount the GGUF repo:
hf-mount start --hf-token $HF_TOKEN repo unsloth/GLM-5.2-GGUF /mnt/glm52-gguf
2. List available GGUF files (pick a quant that fits your RAM/VRAM):
ls -lh /mnt/glm52-gguf/*.gguf
3. Create a Modelfile (see STEP 6 below)
4. Build the Ollama model:
ollama create glm-5.2 -f Modelfile
5. Run it:
ollama run glm-5.2
============================================================================
STEP 6: MODELFILE (for Ollama β GGUF required)
============================================================================
File: Modelfile
Place this file in your working directory.
Adjust the GGUF filename and quant to match what you mounted.
>>> CUT HERE β Modelfile content <<<
FROM /mnt/glm52-gguf/GLM-5.2-Q4_K_M.gguf
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 50
PARAMETER num_ctx 32768
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
# System prompt
SYSTEM """You are GLM-5.2, a helpful AI assistant. Respond clearly and concisely."""
# Chat template (GLM format)
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|endoftext|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|endoftext|>
{{ end }}<|assistant|>
{{ .Response }}<|endoftext|>
"""
>>> END Modelfile <<<
============================================================================
STEP 7: OLLAMA MANIFEST (auto-generated)
============================================================================
When you run ollama create glm-5.2 -f Modelfile, Ollama automatically
creates a manifest at:
~/.ollama/models/manifests/registry.ollama.ai/library/glm-5.2/latest
The manifest is a JSON file that maps layers to SHA256 blob digests:
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"mediaType": "application/vnd.ollama.image.model",
"digest": "sha256:",
"size":
},
"layers": [
{
"mediaType": "application/vnd.ollama.image.model",
"digest": "sha256:",
"size":
}
]
}
To inspect the manifest:
ollama show glm-5.2 --modelfile
To view the raw manifest:
cat ~/.ollama/models/manifests/registry.ollama.ai/library/glm-5.2/latest
============================================================================
STEP 8: UNMOUNT WHEN DONE
============================================================================
NFS or FUSE (macOS):
umount /mnt/glm52
FUSE (Linux):
fusermount -u /mnt/glm52
Or use the daemon stop command:
hf-mount stop /mnt/glm52
============================================================================
ALTERNATIVE: USE HF INFERENCE ENDPOINT (no local GPU needed)
============================================================================
GLM-5.2 has live inference providers on HF:
novita, together, fireworks-ai, featherless-ai, zai-org, deepinfra
You can call it via the HF Inference API with your token:
curl https://api-inference.huggingface.co/models/zai-org/GLM-5.2
-H "Authorization: Bearer $HF_TOKEN"
-H "Content-Type: application/json"
-d '{
"inputs": "Hello, what can you do?",
"parameters": {
"max_new_tokens": 256,
"temperature": 0.7
}
}'
Or use the OpenAI-compatible endpoint via HF Router:
curl https://router.huggingface.co/v1/chat/completions
-H "Authorization: Bearer $HF_TOKEN"
-H "Content-Type: application/json"
-d '{
"model": "zai-org/GLM-5.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 256
}'