Spaces:
Running
Running
title: README | |
emoji: π | |
colorFrom: pink | |
colorTo: indigo | |
sdk: static | |
pinned: false | |
license: mit | |
short_description: Compressed Large Language Models | |
# Compressed Large Language Models | |
This repo contains compressed LLMs used in the [Decoding Compressed Trust](https://decoding-comp-trust.github.io/) project. | |
The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and | |
[Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/). | |
License: [MIT License](https://opensource.org/license/mit/) | |
Simplified lists: | |
* Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3 | |
* Compression methods: | |
- Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured) | |
- Quantization: AWQ, GPTQ (3,4,8 bits) | |
Setup environment | |
```shell | |
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 | |
pip install transformers==4.31.0 | |
pip install accelerate | |
pip install auto-gptq # for gptq | |
``` | |
## How to use models | |
How to use pruned models | |
```python | |
import torch | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
base_model = 'llama-2-7b' | |
comp_method = 'magnitude_unstructured' | |
comp_degree = 0.2 | |
model_path = f'compressed-llm/{base_model}_{comp_method}' | |
model = AutoModelForCausalLM.from_pretrained( | |
model_path, | |
revision=f's{comp_degree}', | |
torch_dtype=torch.float16, | |
low_cpu_mem_usage=True, | |
device_map="auto" | |
) | |
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf') | |
input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda() | |
outputs = model.generate(input_ids, max_new_tokens=128) | |
print(tokenizer.decode(outputs[0])) | |
``` | |
How to use wanda+gptq models | |
```python | |
from transformers import AutoTokenizer | |
from auto_gptq import AutoGPTQForCausalLM | |
model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' | |
tokenizer_path = 'meta-llama/Llama-2-7b-hf' | |
model = AutoGPTQForCausalLM.from_quantized( | |
model_path, | |
# inject_fused_attention=False, # or | |
disable_exllama=True, | |
device_map='auto', | |
) | |
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) | |
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') | |
outputs = model.generate(input_ids=input_ids, max_length=128) | |
tokenizer.decode(outputs[0]) | |
``` | |
How to use gptq models | |
```python | |
from transformers import AutoTokenizer | |
from auto_gptq import AutoGPTQForCausalLM | |
# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' | |
# tokenizer_path = 'meta-llama/Llama-2-7b-hf' | |
model_path = 'compressed-llm/vicuna-7b-v1.3_gptq' | |
tokenizer_path = 'lmsys/vicuna-7b-v1.3' | |
model = AutoGPTQForCausalLM.from_quantized( | |
model_path, | |
# inject_fused_attention=False, # or | |
disable_exllama=True, | |
device_map='auto', | |
revision='2bit_128g', | |
) | |
from transformers import AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) | |
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') | |
outputs = model.generate(input_ids=input_ids, max_length=128) | |
tokenizer.decode(outputs[0]) | |
``` | |
## Citations | |
If you are using models in this hub, please consider citing our papers. | |
```bibtex | |
@article{hong2024comptrust, | |
title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression}, | |
author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng | |
and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James | |
and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya | |
and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li}, | |
journal={arXiv}, | |
year={2024} | |
} | |
``` | |
Some of the models were used in previous publications. | |
```bibtex | |
@article{jaiswal2023emergence, | |
title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter}, | |
author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang}, | |
journal={arXiv}, | |
year={2023} | |
} | |
@article{jaiswal2023compressing, | |
title={Compressing LLMs: The Truth is Rarely Pure and Never Simple}, | |
author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang}, | |
year={2023}, | |
journal={arXiv}, | |
} | |
``` | |
## Acknowlegement | |
Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations. | |
For any question, please contact [Junyuan Hong](mailto:jyhong@utexas.edu). |