abhinavkulkarni commited on
Commit
a3fd730
1 Parent(s): 3c9ea33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md CHANGED
@@ -12,6 +12,139 @@ inference: false
12
  This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).
13
 
14
  ## Model Date
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  July 5, 2023
17
 
 
12
  This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).
13
 
14
  ## Model Date
15
+ ---
16
+ license: cc-by-sa-3.0
17
+ tags:
18
+ - MosaicML
19
+ - AWQ
20
+ inference: false
21
+ ---
22
+
23
+ # MPT-7B-Chat (4-bit 128g AWQ Quantized)
24
+ [MPT-7B-Chat](https://huggingface.co/mosaicml/mpt-7b-chat) is a chatbot-like model for dialogue generation.
25
+
26
+ This model is a 4-bit 128 group size AWQ quantized model. For more information about AWQ quantization, please click [here](https://github.com/mit-han-lab/llm-awq).
27
+
28
+ ## Model Date
29
+
30
+ July 5, 2023
31
+
32
+ ## Model License
33
+
34
+ Please refer to original MPT model license ([link](https://huggingface.co/mosaicml/mpt-7b-chat)).
35
+
36
+ Please refer to the AWQ quantization license ([link](https://github.com/llm-awq/blob/main/LICENSE)).
37
+
38
+ ## CUDA Version
39
+
40
+ This model was successfully tested on CUDA driver v12.1 and toolkit v11.7 with Python v3.10.11.
41
+
42
+ ## How to Use
43
+
44
+ ```bash
45
+ git clone https://github.com/mit-han-lab/llm-awq \
46
+ && cd llm-awq \
47
+ && git checkout 71d8e68df78de6c0c817b029a568c064bf22132d \
48
+ && pip install -e . \
49
+ && cd awq/kernels \
50
+ python setup.py install
51
+ ```
52
+
53
+ ```python
54
+ import torch
55
+ from awq.quantize.quantizer import real_quantize_model_weight
56
+ from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
57
+ from accelerate import init_empty_weights, load_checkpoint_and_dispatch
58
+ from huggingface_hub import hf_hub_download
59
+
60
+ model_name = "mosaicml/mpt-7b-chat"
61
+
62
+ # Config
63
+ config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
64
+
65
+ # Tokenizer
66
+ tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name)
67
+
68
+ # Model
69
+ w_bit = 4
70
+ q_config = {
71
+ "zero_point": True,
72
+ "q_group_size": 128,
73
+ }
74
+
75
+ load_quant = hf_hub_download('abhinavkulkarni/mpt-7b-chat-w4-g128-awq', 'pytorch_model.bin')
76
+
77
+ with init_empty_weights():
78
+ model = AutoModelForCausalLM.from_pretrained(model_name, config=config,
79
+ torch_dtype=torch.float16, trust_remote_code=True)
80
+
81
+ real_quantize_model_weight(model, w_bit=w_bit, q_config=q_config, init_only=True)
82
+
83
+ model = load_checkpoint_and_dispatch(model, load_quant, device_map="balanced")
84
+
85
+ # Inference
86
+ prompt = f'''What is the difference between nuclear fusion and fission?
87
+ ###Response:'''
88
+
89
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
90
+ output = model.generate(
91
+ inputs=input_ids,
92
+ temperature=0.7,
93
+ max_new_tokens=512,
94
+ top_p=0.15,
95
+ top_k=0,
96
+ repetition_penalty=1.1,
97
+ eos_token_id=tokenizer.eos_token_id
98
+ )
99
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
100
+ ```
101
+
102
+ ## Evaluation
103
+
104
+ This evaluation was done using [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness).
105
+
106
+ [MPT-7B-Chat](https://huggingface.co/mosaicml/mpt-7b-chat)
107
+
108
+ | Task |Version| Metric | Value | |Stderr|
109
+ |--------|------:|---------------|------:|---|------|
110
+ |wikitext| 1|word_perplexity|13.5936| | |
111
+ | | |byte_perplexity| 1.6291| | |
112
+ | | |bits_per_byte | 0.7040| | |
113
+
114
+ [MPT-7B-Chat (4-bit 128-group AWQ)](https://huggingface.co/abhinavkulkarni/mpt-7b-chat-w4-g128-awq)
115
+
116
+ | Task |Version| Metric | Value | |Stderr|
117
+ |--------|------:|---------------|------:|---|------|
118
+ |wikitext| 1|word_perplexity|14.0922| | |
119
+ | | |byte_perplexity| 1.6401| | |
120
+ | | |bits_per_byte | 0.7138| | |
121
+
122
+
123
+ ## Acknowledgements
124
+
125
+ The MPT model was originally finetuned by Sam Havens and the MosaicML NLP team. Please cite this model using the following format:
126
+
127
+ ```
128
+ @online{MosaicML2023Introducing,
129
+ author = {MosaicML NLP Team},
130
+ title = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
131
+ year = {2023},
132
+ url = {www.mosaicml.com/blog/mpt-7b},
133
+ note = {Accessed: 2023-03-28}, % change this date
134
+ urldate = {2023-03-28} % change this date
135
+ }
136
+ ```
137
+
138
+ The model was quantized with AWQ technique. If you find AWQ useful or relevant to your research, please kindly cite the paper:
139
+
140
+ ```
141
+ @article{lin2023awq,
142
+ title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
143
+ author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
144
+ journal={arXiv},
145
+ year={2023}
146
+ }
147
+ ```
148
 
149
  July 5, 2023
150