How to improve inference runtime performance?

#67
by redraptor - opened

I've attempted several methods including https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47 and https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert but it seems like bettertransformer doesnt work with mpt-7b yet. So was wondering if anyone here has had success or additional suggestions on how to improve inference speed. Thanks

name = 'mosaicml/mpt-7b-instruct'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:6'
model_name = 'mosaicml/mpt-7b-instruct'
model = AutoModelForCausalLM.from_pretrained(
model_name,
config=config,
trust_remote_code=True,
torch_dtype=bfloat16,
max_seq_len=512
)

generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
use_fast = True,
stopping_criteria=stopping_criteria,
temperature=.5,
top_p=0,
top_k=0,
max_new_tokens=1250,
repetition_penalty=1.0,

)

Yeah, I'm facing the same issue. Generation rates in Google Colab with a 15 GB GPU are only about 1 token per second. That's really terrible. I'm using 4 bit quantisation, which means

I think it may be partly because of not using triton

config.attn_config['attn_impl'] = 'triton'

However, using triton fails when I try - see here.

BTW, here is the config that is giving me 1 tok/s:

# Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.
config.max_seq_len = 1024

model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=cache_dir) # for inference use 'auto', for training us device_map={"":0}

On the other hand, back of the envelope is that T4s have got 8 TFLOPS of compute, and we need 1,000 prompt tokens x 7B params x2 (for multiplication + addition) x ~1/2 (for quantisation benefit) = 7T floating point operations required per token of output. So maybe 1 tok/s is about right? I'd be interested in whether Triton helps more (quantisation down to 4 bit should give 4x improvement, not 2x like above).

Yeah, I'm facing the same issue. Generation rates in Google Colab with a 15 GB GPU are only about 1 token per second. That's really terrible. I'm using 4 bit quantisation, which means

I think it may be partly because of not using triton

config.attn_config['attn_impl'] = 'triton'

However, using triton fails when I try - see here.

BTW, here is the config that is giving me 1 tok/s:

# Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.
config.max_seq_len = 1024

model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=cache_dir) # for inference use 'auto', for training us device_map={"":0}

On the other hand, back of the envelope is that T4s have got 8 TFLOPS of compute, and we need 1,000 prompt tokens x 7B params x2 (for multiplication + addition) x ~1/2 (for quantisation benefit) = 7T floating point operations required per token of output. So maybe 1 tok/s is about right? I'd be interested in whether Triton helps more (quantisation down to 4 bit should give 4x improvement, not 2x like above).

Do you know why my model goes crazy after using config.attn_config['attn_impl'] = 'triton'? My output is ��� exceedsельельителельителしているしている性しているしているしているしているâしているしている性

@sam-mosaic any tips here? Appreciate it, Ronan

Sign up or log in to comment