Spaces:
Running
Running
metadata
title: README
emoji: π
colorFrom: pink
colorTo: indigo
sdk: static
pinned: false
license: mit
short_description: Compressed Large Language Models
Compressed Large Language Models
This repo contains compressed LLMs used in the Decoding Compressed Trust project. The models are prepared by Visual Informatics Group @ University of Texas at Austin (VITA-group) and Center for Applied Scientific Computing at LLNL.
License: MIT License
Simplified lists:
- Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3
- Compression methods:
- Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured)
- Quantization: AWQ, GPTQ (3,4,8 bits)
Setup environment
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.31.0
pip install accelerate
pip install auto-gptq # for gptq
How to use models
How to use pruned models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'llama-2-7b'
comp_method = 'magnitude_unstructured'
comp_degree = 0.2
model_path = f'compressed-llm/{base_model}_{comp_method}'
model = AutoModelForCausalLM.from_pretrained(
model_path,
revision=f's{comp_degree}',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
How to use wanda+gptq models
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
How to use gptq models
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
tokenizer_path = 'lmsys/vicuna-7b-v1.3'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
revision='2bit_128g',
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
Citations
If you are using models in this hub, please consider citing our papers.
@article{hong2024comptrust,
title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng
and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James
and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya
and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
journal={arXiv},
year={2024}
}
Some of the models were used in previous publications.
@article{jaiswal2023emergence,
title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
journal={arXiv},
year={2023}
}
@article{jaiswal2023compressing,
title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
year={2023},
journal={arXiv},
}
Acknowlegement
Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations.
For any question, please contact Junyuan Hong.