--- title: README emoji: 🐇 colorFrom: pink colorTo: indigo sdk: static pinned: false license: mit short_description: Compressed Large Language Models --- # Compressed Large Language Models This repo contains compressed LLMs used in the [Decoding Compressed Trust](https://decoding-comp-trust.github.io/) project. The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and [Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/). License: [MIT License](https://opensource.org/license/mit/) Simplified lists: * Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3 * Compression methods: - Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured) - Quantization: AWQ, GPTQ (3,4,8 bits) Setup environment ```shell pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 pip install transformers==4.31.0 pip install accelerate pip install auto-gptq # for gptq ``` ## How to use models How to use pruned models ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer base_model = 'llama-2-7b' comp_method = 'magnitude_unstructured' comp_degree = 0.2 model_path = f'compressed-llm/{base_model}_{comp_method}' model = AutoModelForCausalLM.from_pretrained( model_path, revision=f's{comp_degree}', torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf') input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda() outputs = model.generate(input_ids, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` How to use wanda+gptq models ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' tokenizer_path = 'meta-llama/Llama-2-7b-hf' model = AutoGPTQForCausalLM.from_quantized( model_path, # inject_fused_attention=False, # or disable_exllama=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') outputs = model.generate(input_ids=input_ids, max_length=128) tokenizer.decode(outputs[0]) ``` How to use gptq models ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM # model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' # tokenizer_path = 'meta-llama/Llama-2-7b-hf' model_path = 'compressed-llm/vicuna-7b-v1.3_gptq' tokenizer_path = 'lmsys/vicuna-7b-v1.3' model = AutoGPTQForCausalLM.from_quantized( model_path, # inject_fused_attention=False, # or disable_exllama=True, device_map='auto', revision='2bit_128g', ) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') outputs = model.generate(input_ids=input_ids, max_length=128) tokenizer.decode(outputs[0]) ``` ## Citations If you are using models in this hub, please consider citing our papers. ```bibtex @article{hong2024comptrust, title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression}, author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li}, journal={arXiv}, year={2024} } ``` Some of the models were used in previous publications. ```bibtex @article{jaiswal2023emergence, title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter}, author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang}, journal={arXiv}, year={2023} } @article{jaiswal2023compressing, title={Compressing LLMs: The Truth is Rarely Pure and Never Simple}, author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang}, year={2023}, journal={arXiv}, } ``` ## Acknowlegement Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations. For any question, please contact [Junyuan Hong](mailto:jyhong@utexas.edu).