How to quantify and accelerate this model

#5
by cheepoin - opened
This comment has been hidden
cheepoin changed discussion status to closed
cheepoin changed discussion title from how to train this model? to test
cheepoin changed discussion title from test to How to quantify and accelerate this model

I tried faster-transformer and failed on that. Any ideas?

cheepoin changed discussion status to open

year, any ideas to accelerate the model? Translation in google colab (CPU and GPU) are extremely slow. Longer than I can translate manually.

Hi @gembird

Currently the only way to accelerate inference on CPU & GPU is to use the BetterTransformer API for the encoder part of MBart, I believe this will not accelerate much the translation as most of the bottleneck happens on the decoder side I believe.

On GPU, in case you are running your translations with batch_size=1 you can try your hands with quantization and fast kernel from bitsandbytes, by making sure you load your model with bnb_4bit_compute_dtype=torch.float16

import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", quantization_config=quantization_config)
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_hi,
    forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire dans la Syrie."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar,
    forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

More generally we are currently migrating attention layers to use torch.scaled_dot_product_attention in transformers core, which should lead to much faster inference. Please have a look at https://github.com/huggingface/transformers/pull/26572 for further details and make sure you can test that feature directly once the support is going to be added on most architecutres, including MBart

Hi @ybelkada , I gave that quantization example you provided a shot and I'm getting a weird result. GPU inference without quantization works fine but when I add the quantization config I'm now getting something like ['okay', 'okay'] when I run inference with a sample sentence. Seems to be just random tokens so I'm wondering if there's an issue with the quantization configuration.

Sign up or log in to comment