Edit model card

Compressed LLM Model Zone

The models are prepared by Visual Informatics Group @ University of Texas at Austin (VITA-group). Credits to Ajay Jaiswal, Zhenyu Zhang, Zhangheng Li, Lu Yin, Shiwei Liu and Junyuan Hong.

License: MIT License

Setup environment

pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.31.0
pip install accelerate
pip install auto-gptq  # for gptq
pip install sentencepiece

How to use pruned models

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'llama-2-7b'
comp_method = 'magnitude_unstructured'
comp_degree = 0.2
model_path = f'vita-group/{base_model}_{comp_method}'
model = AutoModelForCausalLM.from_pretrained(
        model_path, 
        revision=f's{comp_degree}',
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True, 
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

How to use wanda+gptq models

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        # inject_fused_attention=False, # or 
        disable_exllama=True,
        device_map='auto',
    )
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])

How to use gptq models

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model_path = 'vita-group/vicuna-7b-v1.3_gptq'
tokenizer_path = 'lmsys/vicuna-7b-v1.3'
model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        # inject_fused_attention=False, # or 
        disable_exllama=True,
        device_map='auto',
        revision='2bit_128g',
    )
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
Base Model Model Size Compression Method Compression Degree
0 Llama-2 7b magnitude_unstructured s0.1
1 Llama-2 7b magnitude_unstructured s0.2
2 Llama-2 7b magnitude_unstructured s0.3
3 Llama-2 7b magnitude_unstructured s0.5
4 Llama-2 7b magnitude_unstructured s0.6
5 Llama-2 7b sparsegpt_unstructured s0.1
6 Llama-2 7b sparsegpt_unstructured s0.2
7 Llama-2 7b sparsegpt_unstructured s0.3
8 Llama-2 7b sparsegpt_unstructured s0.5
9 Llama-2 7b sparsegpt_unstructured s0.6
10 Llama-2 7b wanda_gptq 4bit_128g
11 Llama-2 7b wanda_unstructured s0.1
12 Llama-2 7b wanda_unstructured s0.2
13 Llama-2 7b wanda_unstructured s0.3
14 Llama-2 7b wanda_unstructured s0.5
15 Llama-2 7b wanda_unstructured s0.6
16 vicuna-v1.3 13b gptq 10bit_128g
17 vicuna-v1.3 13b gptq 12bit_128g
18 vicuna-v1.3 13b gptq 14bit_128g
19 vicuna-v1.3 13b gptq 2bit_128g
20 vicuna-v1.3 13b gptq 3bit_128g
21 vicuna-v1.3 13b gptq 4bit_128g
22 vicuna-v1.3 13b gptq 6bit_128g
23 vicuna-v1.3 13b gptq 8bit_128g
24 vicuna-v1.3 7b gptq 10bit_128g
25 vicuna-v1.3 7b gptq 12bit_128g
26 vicuna-v1.3 7b gptq 14bit_128g
27 vicuna-v1.3 7b gptq 2bit_128g
28 vicuna-v1.3 7b gptq 3bit_128g
29 vicuna-v1.3 7b gptq 4bit_128g
30 vicuna-v1.3 7b gptq 6bit_128g
31 vicuna-v1.3 7b gptq 8bit_128g
Downloads last month
36
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.