mobicham commited on
Commit
453e34d
1 Parent(s): 991f032

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -1,3 +1,31 @@
1
  ---
2
  license: llama2
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ train: false
4
+ inference: false
5
+ pipeline_tag: text-generation
6
  ---
7
+
8
+ ## Llama-2-70b-hf-2bit_g16_s128-HQQ
9
+ This a version of the LLama2-70B model quantized to 2-bit via Half-Quadratic Quantization (HQQ): https://mobiusml.github.io/hqq/
10
+
11
+ This model outperforms an fp16 LLama2-13B (perplexity 4.13 vs. 4.63) for a comparable ~26GB size.
12
+
13
+ To run the model, install the HQQ library from https://github.com/mobiusml/hqq/tree/main/code and use it as follows:
14
+ ``` Python
15
+ from hqq.models.llama import LlamaHQQ
16
+ import transformers
17
+
18
+ model_id = 'mobiuslabsgmbh/Llama-2-70b-hf-2bit_g16_s128-HQQ'
19
+ #Load the tokenizer
20
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
21
+ #Load the model
22
+ model = LlamaHQQ.from_quantized(model_id)
23
+ ```
24
+
25
+ *Limitations*: <br>
26
+ -Only supports single GPU runtime.<br>
27
+ -Not compatible with HuggingFace's PEFT.<br>
28
+
29
+
30
+
31
+