YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Mount GLM-5.2 Locally Without Downloading β€” Complete Guide

Model: zai-org/GLM-5.2 (753B params, MoE, safetensors)

Using: hf-mount + HF_TOKEN + Modelfile/Manifest + vLLM/Ollama

============================================================================

PREREQUISITES

============================================================================

1. Hugging Face Access Token (Read scope)

β†’ https://huggingface.co/settings/tokens

2. The model is in your bucket (you said you added it)

β†’ Your bucket path: adminglory/

β†’ OR mount the original repo: zai-org/GLM-5.2

3. Operating System:

β†’ Linux x86_64 / aarch64, macOS Apple Silicon, or Windows (NFS only)

============================================================================

============================================================================

STEP 1: SET YOUR HF TOKEN

============================================================================

export HF_TOKEN="hf_your_token_here"

Optional: persist in shell profile

echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.bashrc # or ~/.zshrc source ~/.bashrc

============================================================================

STEP 2: INSTALL hf-mount

============================================================================

--- macOS or Linux (Homebrew) ---

brew install hf-mount

--- Linux without Homebrew (install script) ---

curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh

This installs to ~/.local/bin/ β€” add to PATH if needed:

export PATH="$HOME/.local/bin:$PATH"

--- Windows (NFS backend only) ---

Download from: https://github.com/huggingface/hf-mount/releases/latest

Enable NFS Client:

Enable-WindowsOptionalFeature -Online -FeatureName ServicesForNFS-ClientOnly,ClientForNFS-Infrastructure -All

Run as Administrator (port 111 is privileged)

For FUSE backend on Linux (optional, tighter kernel integration):

sudo apt-get install -y fuse3

============================================================================

STEP 3: MOUNT THE MODEL (NO FULL DOWNLOAD β€” LAZY LOADING)

============================================================================

--- Option A: Mount from YOUR bucket (read-write) ---

hf-mount start --hf-token $HF_TOKEN bucket adminglory/your-bucket-name /mnt/glm52

--- Option B: Mount from the original model repo (read-only) ---

hf-mount start --hf-token $HF_TOKEN repo zai-org/GLM-5.2 /mnt/glm52

--- Option C: Mount a GGUF version for Ollama (read-only) ---

hf-mount start --hf-token $HF_TOKEN repo unsloth/GLM-5.2-GGUF /mnt/glm52-gguf

--- Option D: FUSE backend (tighter integration, requires fuse3/macFUSE) ---

hf-mount start --fuse --hf-token $HF_TOKEN repo zai-org/GLM-5.2 /mnt/glm52

Verify the mount:

ls -la /mnt/glm52/

You should see: config.json, model-*.safetensors, tokenizer.json, etc.

Files are fetched LAZILY β€” only bytes you read hit the network.

============================================================================

STEP 4: VERIFY MOUNTED FILES

============================================================================

Check config exists

cat /mnt/glm52/config.json | head -20

List weight files

ls -lh /mnt/glm52/*.safetensors

List tokenizer files

ls /mnt/glm52/tokenizer*

============================================================================

STEP 5A: SERVE WITH vLLM (RECOMMENDED β€” supports safetensors directly)

============================================================================

vLLM supports more architectures than Ollama and can load safetensors

from any local path (including hf-mount paths).

IMPORTANT: For NFS/network mounts, use --safetensors-load-strategy eager

to avoid inefficient random reads. vLLM auto-detects NFS and may

auto-enable prefetching if the checkpoint fits in 90% of RAM.

pip install vllm

Serve from the mounted path:

vllm serve /mnt/glm52
--served-model-name glm-5.2
--host 0.0.0.0
--port 8000
--tensor-parallel-size 8
--gpu-memory-utilization 0.90
--safetensors-load-strategy eager
--max-model-len 32768
--trust-remote-code

Test the endpoint:

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "glm-5.2", "messages": [ {"role": "user", "content": "Hello! What can you do?"} ] }'

============================================================================

STEP 5B: SERVE WITH OLLAMA (requires GGUF format)

============================================================================

NOTE: GLM-5.2 uses "glm_moe_dsa" architecture.

Ollama's safetensors import ONLY supports: Llama, Mistral, Gemma, Phi3.

Therefore, you MUST use a GGUF version for Ollama.

Available GGUF quants:

- unsloth/GLM-5.2-GGUF (most popular, 125K downloads)

- cfontes/GLM-5.2-Q4_K_M-GGUF

- Abiray/GLM-5.2-Q4_K_M-GGUF

1. Mount the GGUF repo:

hf-mount start --hf-token $HF_TOKEN repo unsloth/GLM-5.2-GGUF /mnt/glm52-gguf

2. List available GGUF files (pick a quant that fits your RAM/VRAM):

ls -lh /mnt/glm52-gguf/*.gguf

3. Create a Modelfile (see STEP 6 below)

4. Build the Ollama model:

ollama create glm-5.2 -f Modelfile

5. Run it:

ollama run glm-5.2

============================================================================

STEP 6: MODELFILE (for Ollama β€” GGUF required)

============================================================================

File: Modelfile

Place this file in your working directory.

Adjust the GGUF filename and quant to match what you mounted.

>>> CUT HERE β€” Modelfile content <<<

FROM /mnt/glm52-gguf/GLM-5.2-Q4_K_M.gguf

# Model parameters

PARAMETER temperature 0.7

PARAMETER top_p 0.9

PARAMETER top_k 50

PARAMETER num_ctx 32768

PARAMETER repeat_penalty 1.1

PARAMETER stop "<|endoftext|>"

PARAMETER stop "<|user|>"

PARAMETER stop "<|assistant|>"

# System prompt

SYSTEM """You are GLM-5.2, a helpful AI assistant. Respond clearly and concisely."""

# Chat template (GLM format)

TEMPLATE """{{ if .System }}<|system|>

{{ .System }}<|endoftext|>

{{ end }}{{ if .Prompt }}<|user|>

{{ .Prompt }}<|endoftext|>

{{ end }}<|assistant|>

{{ .Response }}<|endoftext|>

"""

>>> END Modelfile <<<

============================================================================

STEP 7: OLLAMA MANIFEST (auto-generated)

============================================================================

When you run ollama create glm-5.2 -f Modelfile, Ollama automatically

creates a manifest at:

~/.ollama/models/manifests/registry.ollama.ai/library/glm-5.2/latest

The manifest is a JSON file that maps layers to SHA256 blob digests:

{

"schemaVersion": 2,

"mediaType": "application/vnd.docker.distribution.manifest.v2+json",

"config": {

"mediaType": "application/vnd.ollama.image.model",

"digest": "sha256:",

"size":

},

"layers": [

{

"mediaType": "application/vnd.ollama.image.model",

"digest": "sha256:",

"size":

}

]

}

To inspect the manifest:

ollama show glm-5.2 --modelfile

To view the raw manifest:

cat ~/.ollama/models/manifests/registry.ollama.ai/library/glm-5.2/latest

============================================================================

STEP 8: UNMOUNT WHEN DONE

============================================================================

NFS or FUSE (macOS):

umount /mnt/glm52

FUSE (Linux):

fusermount -u /mnt/glm52

Or use the daemon stop command:

hf-mount stop /mnt/glm52

============================================================================

ALTERNATIVE: USE HF INFERENCE ENDPOINT (no local GPU needed)

============================================================================

GLM-5.2 has live inference providers on HF:

novita, together, fireworks-ai, featherless-ai, zai-org, deepinfra

You can call it via the HF Inference API with your token:

curl https://api-inference.huggingface.co/models/zai-org/GLM-5.2
-H "Authorization: Bearer $HF_TOKEN"
-H "Content-Type: application/json"
-d '{ "inputs": "Hello, what can you do?", "parameters": { "max_new_tokens": 256, "temperature": 0.7 } }'

Or use the OpenAI-compatible endpoint via HF Router:

curl https://router.huggingface.co/v1/chat/completions
-H "Authorization: Bearer $HF_TOKEN"
-H "Content-Type: application/json"
-d '{ "model": "zai-org/GLM-5.2", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 256 }'

============================================================================

ARCHITECTURE COMPATIBILITY SUMMARY

============================================================================

| Tool | Format Needed | GLM-5.2 Supported? | Mount Works? |

|-----------|---------------|---------------------|--------------|

| vLLM | safetensors | YES (remote_code) | YES |

| Ollama | GGUF | YES (via GGUF) | YES |

| Ollama | safetensors | NO (arch not listed)| N/A |

| TGI | safetensors | YES | YES |

| llama.cpp | GGUF | YES (via GGUF) | YES |

| HF API | N/A (remote) | YES (live providers)| N/A |

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support