Instructions to use cyankiwi/GLM-4.7-Flash-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cyankiwi/GLM-4.7-Flash-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cyankiwi/GLM-4.7-Flash-AWQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cyankiwi/GLM-4.7-Flash-AWQ-4bit") model = AutoModelForCausalLM.from_pretrained("cyankiwi/GLM-4.7-Flash-AWQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cyankiwi/GLM-4.7-Flash-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyankiwi/GLM-4.7-Flash-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/GLM-4.7-Flash-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cyankiwi/GLM-4.7-Flash-AWQ-4bit
- SGLang
How to use cyankiwi/GLM-4.7-Flash-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cyankiwi/GLM-4.7-Flash-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/GLM-4.7-Flash-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cyankiwi/GLM-4.7-Flash-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/GLM-4.7-Flash-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use cyankiwi/GLM-4.7-Flash-AWQ-4bit with Docker Model Runner:
docker model run hf.co/cyankiwi/GLM-4.7-Flash-AWQ-4bit
Could you show the benchmark loss for this quantization?
I am curious how the perplexity loss might cause some of the agentic tasks to fail more often.
Thanks for raising the problem to me. The perplexity measurement was made on wikitext dataset, and therefore, not covering tool usage and agentic task.
I am recently aware that the quantized models struggle in tool usage and agentic tasks, which might be a result from the lack of tool calling and agentic calibration dataset. The model was calibrated using nvidia/Nemotron-Post-Training-Dataset-v2, and nvidia/Nemotron-Post-Training-Dataset-v2 does not have tool calling and agentic calibration data.
I will look into this in more detail, and solve this problem soon.
Ideally include benchmarks for most 4bit quants cus it is easier to see what might get broken by accident
Yes, I intend to complete full evaluations of my models, but currently I'm limited by my resources.
how many samples of the post-training-dataset do you use for the calibration?
@whadupapp For calibration, I use the 256 samples from nvidia/Nemotron-Post-Training-Dataset-v2 dataset, with tokens routed to every experts for calibration. Do you also get this problem?
@TomLucidor Could you tell me in more details of the failed agentic tasks? Do the tool-calling and agentic outputs get mixed in its thinking traces, and ultimately lead to agentic task failure?
I was just wondering how the number or the selecton of samples impacts the downstream performance of the model
I did notice the mixing of thinking and toolcalling for the glm4.5 awq quant, though
@cpatonn there are many failure modes e.g. </think> tag parsing, failing to understand the prompt, thought/action loops, mojibake/jibberish due to quantization/artifacts, etc...
(I am also looking to see if Qwen3-Coder-Next-REAP and Kimi-Linear has similar issues, since liner models are likely faster)