Spaces:

compressed-llm
/

README

Running

App Files Files Community

README / README.md

jyhong836

Update README.md

83f93fd verified 11 months ago

preview code

raw

history blame contribute delete

4.77 kB

	---
	title: README
	emoji: 🐇
	colorFrom: pink
	colorTo: indigo
	sdk: static
	pinned: false
	license: mit
	short_description: Compressed Large Language Models
	---
	# Compressed Large Language Models

	This repo contains compressed LLMs used in the [Decoding Compressed Trust](https://decoding-comp-trust.github.io/) project.
	The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and
	[Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/).

	License: [MIT License](https://opensource.org/license/mit/)

	Simplified lists:
	* Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3
	* Compression methods:
	- Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured)
	- Quantization: AWQ, GPTQ (3,4,8 bits)

	Setup environment
	```shell
	pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
	pip install transformers==4.31.0
	pip install accelerate
	pip install auto-gptq # for gptq
	```

	## How to use models

	How to use pruned models
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	base_model = 'llama-2-7b'
	comp_method = 'magnitude_unstructured'
	comp_degree = 0.2
	model_path = f'compressed-llm/{base_model}_{comp_method}'
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	revision=f's{comp_degree}',
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
	input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
	outputs = model.generate(input_ids, max_new_tokens=128)
	print(tokenizer.decode(outputs[0]))
	```

	How to use wanda+gptq models
	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM
	model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
	tokenizer_path = 'meta-llama/Llama-2-7b-hf'
	model = AutoGPTQForCausalLM.from_quantized(
	model_path,
	# inject_fused_attention=False, # or
	disable_exllama=True,
	device_map='auto',
	)
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
	input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
	outputs = model.generate(input_ids=input_ids, max_length=128)
	tokenizer.decode(outputs[0])
	```

	How to use gptq models
	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM
	# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
	# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
	model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
	tokenizer_path = 'lmsys/vicuna-7b-v1.3'
	model = AutoGPTQForCausalLM.from_quantized(
	model_path,
	# inject_fused_attention=False, # or
	disable_exllama=True,
	device_map='auto',
	revision='2bit_128g',
	)
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
	input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
	outputs = model.generate(input_ids=input_ids, max_length=128)
	tokenizer.decode(outputs[0])
	```

	## Citations

	If you are using models in this hub, please consider citing our papers.
	```bibtex
	@article{hong2024comptrust,
	title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
	author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng
	and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James
	and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya
	and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
	journal={arXiv},
	year={2024}
	}
	```
	Some of the models were used in previous publications.
	```bibtex
	@article{jaiswal2023emergence,
	title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
	author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
	journal={arXiv},
	year={2023}
	}
	@article{jaiswal2023compressing,
	title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
	author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
	year={2023},
	journal={arXiv},
	}
	```

	## Acknowlegement

	Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations.


	For any question, please contact [Junyuan Hong](mailto:jyhong@utexas.edu).