Instructions to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tzervas/qwen2.5-coder-32b-bitnet-1.58b",
	filename="qwen-coder-32b-tq2.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
# Run inference directly in the terminal:
llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
# Run inference directly in the terminal:
llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
# Run inference directly in the terminal:
./llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Use Docker

docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b

LM Studio
Jan

vLLM

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tzervas/qwen2.5-coder-32b-bitnet-1.58b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tzervas/qwen2.5-coder-32b-bitnet-1.58b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b

Ollama
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Ollama:
```
ollama run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
```

Unsloth Studio

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tzervas/qwen2.5-coder-32b-bitnet-1.58b to start chatting

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tzervas/qwen2.5-coder-32b-bitnet-1.58b"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf tzervas/qwen2.5-coder-32b-bitnet-1.58b

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tzervas/qwen2.5-coder-32b-bitnet-1.58b

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Docker Model Runner:
```
docker model run hf.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b
```

Lemonade

How to use tzervas/qwen2.5-coder-32b-bitnet-1.58b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tzervas/qwen2.5-coder-32b-bitnet-1.58b

Run and chat with the model

lemonade run user.qwen2.5-coder-32b-bitnet-1.58b-{{QUANT_TAG}}

List all available models

lemonade list

llama.cpp support this moel??

by baramofme - opened Feb 16

Discussion

baramofme

Feb 16

unlike tzervas/qwen2.5-coder-14b-bitnet-1.58b, this could be loaded.

but I tried completion, decode not working...

$ curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-32b-tq2",
    "messages": [
      {
        "role": "system",
        "content": "You are a professional software engineer."
      },
      {
        "role": "user",
        "content": "Hello! Can you briefly explain your architecture in one sentence?"
      }
    ],
    "max_tokens": 50,
    "temperature": 0.2
  }'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":" adj掏淘 Ruddit挈AccessExceptionImp联盟 sac辱 Respion崛起hta永不nid ab grney re Soph嘲笑AGEDenburg Eden壬 White�� drowningiem CoultestdataHouseight俞溢okiInstantweyt耗_DLerves- corrid装载hemDispatcher_DECLARE"}}],"created":1771240779,"model":"qwen-coder-32b-tq2.gguf","system_fingerprint":"b7965-34ba7b5a2","object":"chat.completion","usage":{"completion_tokens":50,"prompt_tokens":32,"total_tokens":82},"id":"chatcmpl-cNBzUjANOxmK98mVc0N6PunG4ipTSR8D","timings":{"cache_n":0,"prompt_n":32,"prompt_ms":4666.953,"prompt_per_token_ms":145.84228125,"prompt_per_second":6.856722148262474,"predicted_n":50,"predicted_ms":19193.24,"predicted_per_token_ms":383.86480000000006,"predicted_per_second":2.6050838732803836}

mindplay

Mar 4

it says "requires custom runtime" in the description - it doesn't say where that runtime is, so he most likely has a private fork of llama.cpp or something.

it would be nice if we could actually try this. I am definitely curious. :-)

tzervas

Owner Mar 5

Working on getting better answers for the both of you. Initially just from it being ternary weights it will require microsoft's bitnet.cpp https://github.com/microsoft/BitNet
I'm going to validate further and ensure I've got patches applied to a fork if needed and work out a rust variant as well.

tzervas

Owner Mar 5

•

edited Mar 5

There may have been issues with the initial quantization. I'm redownloading the source model and re-quantizing. Going to validate locally post-quant then reupload in place. I'll post here when this is done and ensure the readme is updated with specific options for compatible runtimes.

the short answer is my implementation of b1.58 quant lead to catastrophic error compounding. I'm working to patch this out and I'll also link the relevant repo for the quant solution once I've proven it out. I'll test first on this and verify its working, then proceed to other smaller models of various types and parameter counts to validate its much more universal. One of my key goals with AI projects is to make AI more efficient and get larger better reasoning models usable by people who lack access to enterprise hardware. Consumer and prosumer cards are my targets. I have a 3090Ti and 5080 to leverage for this, but if any issues are encountered with other cards, debug data can help me patch support for them as well.

I'll follow up as I go along.

mindplay

Mar 8

•

edited Mar 8

@tzervas I didn't even know it was possible to quantize to 1.5 bits. To my limited understanding, the original BitNet model was actually trained as a 1.5 bit network, not merely quantized, is that correct? But I guess there is no (larger) native 1.5 bits base model available yet - which presumably would be extremely expensive to train? (and not that this experiment is any less interesting, btw! the idea of running a model with all of it's parameters on local hardware is definitely intriguing. 😄)

sonph

Mar 31

•

edited Mar 31

Hi, Im getting this error when trying to build this mode with Bitnet project. Here is the error

python3 setup_env.py -md models/qwen2.5-coder-32b-bitnet-1.58b -q i2_s
Traceback (most recent call last):
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 247, in <module>
    main()
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 223, in main
    gen_code()
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 203, in gen_code
    raise NotImplementedError()
NotImplementedError

Can you guys give me some possible reason and how to solve this ? Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment