---
license: mit
---

# Quantized BitNet-B1-58-3B

This repository contains a quantized version of the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) model. 
While the original repository showcases impressive validation results, it emulates BitNet's Linear layers, resulting in memory usage similar to fp16 models. By leveraging the QuantLinear module from [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ), this repository enables the output and execution of a 2-bit quantized model.

The quantized model offers significant advantages in terms of model size and memory consumption. With a model size of just 1GB , the quantized 3B model can perform inference with a context size of 2048 while consuming only 4.5GB of VRAM. Furthermore, since the weights used during execution are the same as the original repository, the perplexity (PPL) output remains unchanged.


## Install

```
pip install -r requirements.txt
```

## Quantization

The quantized model is already provided in this repository. However, if you wish to quantize the model yourself, you can load it from 1bitLLM/bitnet_b1_58-3B and save the quantized version (2-bit) to ./bitnet_b1_58-3B_quantized by running the following command:

```
python quantization.py
```


## Evaluation

```
python eval_ppl.py --hf_path ./ --seqlen 2048 --max_dataset_size 1000
```
```
python eval_task.py --hf_path ./ \
    --batch_size 1 \
    --tasks \
    --output_path result.json \
    --num_fewshot 0 \
    --ctx_size 2048
```