Instructions to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="tzervas/qwen2.5-coder-32b-bitnet-1.58b", filename="qwen-coder-32b-tq2.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b # Run inference directly in the terminal: llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b # Run inference directly in the terminal: llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b # Run inference directly in the terminal: ./llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b # Run inference directly in the terminal: ./build/bin/llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Use Docker
docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
- LM Studio
- Jan
- vLLM
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tzervas/qwen2.5-coder-32b-bitnet-1.58b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tzervas/qwen2.5-coder-32b-bitnet-1.58b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
- Ollama
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Ollama:
ollama run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
- Unsloth Studio
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting
- Pi
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "tzervas/qwen2.5-coder-32b-bitnet-1.58b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default tzervas/qwen2.5-coder-32b-bitnet-1.58b
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Docker Model Runner:
docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
- Lemonade
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull tzervas/qwen2.5-coder-32b-bitnet-1.58b
Run and chat with the model
lemonade run user.qwen2.5-coder-32b-bitnet-1.58b-{{QUANT_TAG}}List all available models
lemonade list
llama.cpp support this moel??
unlike tzervas/qwen2.5-coder-14b-bitnet-1.58b, this could be loaded.
but I tried completion, decode not working...
$ curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-32b-tq2",
"messages": [
{
"role": "system",
"content": "You are a professional software engineer."
},
{
"role": "user",
"content": "Hello! Can you briefly explain your architecture in one sentence?"
}
],
"max_tokens": 50,
"temperature": 0.2
}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":" adj掏淘 Ruddit挈AccessExceptionImp联盟 sac辱 Respion崛起hta永不nid ab grney re Soph嘲笑AGEDenburg Eden壬 White�� drowningiem CoultestdataHouseight俞溢okiInstantweyt耗_DLerves- corrid装载hemDispatcher_DECLARE"}}],"created":1771240779,"model":"qwen-coder-32b-tq2.gguf","system_fingerprint":"b7965-34ba7b5a2","object":"chat.completion","usage":{"completion_tokens":50,"prompt_tokens":32,"total_tokens":82},"id":"chatcmpl-cNBzUjANOxmK98mVc0N6PunG4ipTSR8D","timings":{"cache_n":0,"prompt_n":32,"prompt_ms":4666.953,"prompt_per_token_ms":145.84228125,"prompt_per_second":6.856722148262474,"predicted_n":50,"predicted_ms":19193.24,"predicted_per_token_ms":383.86480000000006,"predicted_per_second":2.6050838732803836}
it says "requires custom runtime" in the description - it doesn't say where that runtime is, so he most likely has a private fork of llama.cpp or something.
it would be nice if we could actually try this. I am definitely curious. :-)
Working on getting better answers for the both of you. Initially just from it being ternary weights it will require microsoft's bitnet.cpp https://github.com/microsoft/BitNet
I'm going to validate further and ensure I've got patches applied to a fork if needed and work out a rust variant as well.
There may have been issues with the initial quantization. I'm redownloading the source model and re-quantizing. Going to validate locally post-quant then reupload in place. I'll post here when this is done and ensure the readme is updated with specific options for compatible runtimes.
the short answer is my implementation of b1.58 quant lead to catastrophic error compounding. I'm working to patch this out and I'll also link the relevant repo for the quant solution once I've proven it out. I'll test first on this and verify its working, then proceed to other smaller models of various types and parameter counts to validate its much more universal. One of my key goals with AI projects is to make AI more efficient and get larger better reasoning models usable by people who lack access to enterprise hardware. Consumer and prosumer cards are my targets. I have a 3090Ti and 5080 to leverage for this, but if any issues are encountered with other cards, debug data can help me patch support for them as well.
I'll follow up as I go along.
@tzervas I didn't even know it was possible to quantize to 1.5 bits. To my limited understanding, the original BitNet model was actually trained as a 1.5 bit network, not merely quantized, is that correct? But I guess there is no (larger) native 1.5 bits base model available yet - which presumably would be extremely expensive to train? (and not that this experiment is any less interesting, btw! the idea of running a model with all of it's parameters on local hardware is definitely intriguing. 😄)
Hi, Im getting this error when trying to build this mode with Bitnet project. Here is the error
python3 setup_env.py -md models/qwen2.5-coder-32b-bitnet-1.58b -q i2_s
Traceback (most recent call last):
File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 247, in <module>
main()
File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 223, in main
gen_code()
File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 203, in gen_code
raise NotImplementedError()
NotImplementedError
Can you guys give me some possible reason and how to solve this ? Thanks