compressed-llm
/

vicuna-13b-v1.3-gptq

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

jyhong836 commited on Sep 3, 2023

Commit

93a573f

•

1 Parent(s): 3a1e6d8

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +29 -2

README.md CHANGED Viewed

@@ -37,16 +37,43 @@ outputs = model.generate(input_ids, max_new_tokens=128)
 print(tokenizer.decode(outputs[0]))
 ```
-How to use quantized models
 ```python
-from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
 model = AutoGPTQForCausalLM.from_quantized(
         model_path,
         # inject_fused_attention=False, # or
         disable_exllama=True,
         device_map='auto',
     )
 ```

 print(tokenizer.decode(outputs[0]))
 ```
+How to use wanda+gptq models
 ```python
+from transformers import AutoTokenizer
+from auto_gptq import AutoGPTQForCausalLM
 model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
+tokenizer_path = 'meta-llama/Llama-2-7b-hf'
 model = AutoGPTQForCausalLM.from_quantized(
         model_path,
         # inject_fused_attention=False, # or
         disable_exllama=True,
         device_map='auto',
+        revision='2bit_128g',
     )
+tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
+input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
+outputs = model.generate(input_ids=input_ids, max_length=128)
+tokenizer.decode(outputs[0])
+```
+How to use gptq models
+```python
+from transformers import AutoTokenizer
+from auto_gptq import AutoGPTQForCausalLM
+model_path = 'vita-group/vicuna-7b-v1.3_gptq'
+tokenizer_path = 'lmsys/vicuna-7b-v1.3'
+revision = '2bit_128g'
+model = AutoGPTQForCausalLM.from_quantized(
+        model_path,
+        # inject_fused_attention=False, # or
+        disable_exllama=True,
+        device_map='auto',
+        revision=revision,
+    )
+tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
+input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
+outputs = model.generate(input_ids=input_ids, max_length=128)
+tokenizer.decode(outputs[0])
 ```