File size: 2,369 Bytes
2cccfa4
 
 
 
 
 
 
 
 
bf9d895
 
 
2cccfa4
 
 
d013df6
 
2cccfa4
 
 
 
 
1ed66f8
 
2cccfa4
 
a4f5f9e
 
2cccfa4
ba01de7
2cccfa4
ba01de7
2cccfa4
b4239f2
2cccfa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d04c92
2cccfa4
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
datasets:
- wikipedia
---
# BitLinear-phi-1.5

BitLinear-phi-1.5 is a model trained partially using the method described in [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764).

### Notice: Our BitLinear layer will only apply 1-bit quantization to the weight
### Other components (RMSnorm, activation quant) in the paper is discarded.
Idea behind: The major contribution in their paper is introduced a valid binary weight quantization, we don't want to mix it with other components to make it difficult to evaluate the major part.

The model structure is from [phi-1.5](https://huggingface.co/microsoft/phi-1_5), with all linear layers except lm_head replaced with our custom BitLinear layer.

It was trained on a small subset of the [wikipedia dataset](https://huggingface.co/datasets/wikipedia) dataset, for research validation purpose only. 


```python
dataset = load_dataset("wikipedia", "20220301.en")
dataset = dataset['train'].select(range(int(1e5)))
```
Please notice the kernel is not optimzed for 1-bit matrix yet.

The model is trained on a 3090(24GB) for 16 hours. 

## For faster(3x) inference, check https://github.com/Mrw33554432/Bitlinear4HF and install custom kernel
## For training code, check https://github.com/Mrw33554432/Bitlinear4HF. 

The training code should be compatible with most of the LLMs in huggingface.

Using pretrained model weight (normal models) for training will not work due to gradient explosion.

## Sample inference code (slow)


```python
import torch
from replace_hf import replace_linear_in_hf
from transformers import AutoModelForCausalLM, AutoTokenizer


def quick_test(model, tokenizer, prompt: str):
    # Encode the inputs
    inputs = tokenizer.encode(prompt, return_tensors="pt")

    # Generate outputs
    outputs = model.generate(inputs, max_length=64)

    # Decode and print the outputs
    print(tokenizer.decode(outputs[0]))


torch.set_default_device("cuda")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Mrw33554432/bitLinear-phi-1.5", trust_remote_code=True, torch_dtype=torch.float16)

print(model)
# Replace Linear layers with BitLinear
replace_linear_in_hf(model, keep_param=True)
print(model)

quick_test(model, tokenizer, prompt="Tom is the")
```