Instructions to use Entrit/Qwen2.5-7B-trit-uniform-d2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Entrit/Qwen2.5-7B-trit-uniform-d2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Entrit/Qwen2.5-7B-trit-uniform-d2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Entrit/Qwen2.5-7B-trit-uniform-d2") model = AutoModelForCausalLM.from_pretrained("Entrit/Qwen2.5-7B-trit-uniform-d2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Entrit/Qwen2.5-7B-trit-uniform-d2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Entrit/Qwen2.5-7B-trit-uniform-d2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Entrit/Qwen2.5-7B-trit-uniform-d2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Entrit/Qwen2.5-7B-trit-uniform-d2
- SGLang
How to use Entrit/Qwen2.5-7B-trit-uniform-d2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Entrit/Qwen2.5-7B-trit-uniform-d2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Entrit/Qwen2.5-7B-trit-uniform-d2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Entrit/Qwen2.5-7B-trit-uniform-d2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Entrit/Qwen2.5-7B-trit-uniform-d2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Entrit/Qwen2.5-7B-trit-uniform-d2 with Docker Model Runner:
docker model run hf.co/Entrit/Qwen2.5-7B-trit-uniform-d2
Qwen2.5-7B-trit-uniform-d2
Balanced ternary quantization of Qwen/Qwen2.5-7B at depth d=2 (9 levels per weight, 3.47 bits per weight).
Produced with the codec from "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026). See Entrit/tritllm-codec for the codec source.
Quick load
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Entrit/Qwen2.5-7B-trit-uniform-d2")
tokenizer = AutoTokenizer.from_pretrained("Entrit/Qwen2.5-7B-trit-uniform-d2")
The weights are dequantized to FP16 for stock-transformers compatibility. The on-disk size is therefore the same as the FP16 source. The 3.47-bpw figure refers to the information content of the quantized matrices and is what matters for inference on hardware that consumes the packed trit format directly (see Entrit/tritllm-kernel).
Quantization details
| Field | Value |
|---|---|
| Source model | Qwen/Qwen2.5-7B |
| Depth | d=2 (9 levels) |
| Bits per weight | 3.47 |
| Group size | 16 |
| Scale codebook | 27-entry log-spaced (scale_depth=3) |
| Method | Uniform PTQ |
| Quantized layers | all 2D linear matrices |
| Kept FP16 | lm_head, token embeddings, all *_norm layers |
| Codec | tritllm v2 |
Citation
@article{stentzel2026ternaryptq,
title = {Balanced Ternary Post-Training Quantization for Large Language Models},
author = {Stentzel, Eric},
year = 2026,
note = {Entrit Systems}
}
Reproducibility
git clone https://huggingface.co/Entrit/tritllm-codec
cd tritllm-codec
python quantize_model_v2.py --model Qwen/Qwen2.5-7B --configs uniform-d2 --out ./out
- Downloads last month
- 142
Model tree for Entrit/Qwen2.5-7B-trit-uniform-d2
Base model
Qwen/Qwen2.5-7B