facebook/mbart-large-50-many-to-many-mmt · How to quantify and accelerate this model

cheepoin

Mar 17, 2023

This comment has been hidden

cheepoin changed discussion status to closed Mar 31, 2023

cheepoin changed discussion title from how to train this model? to test Mar 31, 2023

cheepoin changed discussion title from test to How to quantify and accelerate this model Mar 31, 2023

cheepoin

Mar 31, 2023

I tried faster-transformer and failed on that. Any ideas?

cheepoin changed discussion status to open Mar 31, 2023

gembird

Oct 13, 2023

year, any ideas to accelerate the model? Translation in google colab (CPU and GPU) are extremely slow. Longer than I can translate manually.

lysandre

Oct 16, 2023

cc @ybelkada

ybelkada

Oct 16, 2023

Hi @gembird

Currently the only way to accelerate inference on CPU & GPU is to use the BetterTransformer API for the encoder part of MBart, I believe this will not accelerate much the translation as most of the bottleneck happens on the decoder side I believe.

On GPU, in case you are running your translations with batch_size=1 you can try your hands with quantization and fast kernel from bitsandbytes, by making sure you load your model with bnb_4bit_compute_dtype=torch.float16

import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", quantization_config=quantization_config)
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_hi,
    forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire dans la Syrie."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar,
    forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

More generally we are currently migrating attention layers to use torch.scaled_dot_product_attention in transformers core, which should lead to much faster inference. Please have a look at https://github.com/huggingface/transformers/pull/26572 for further details and make sure you can test that feature directly once the support is going to be added on most architecutres, including MBart

rcampbell

Oct 22, 2023

Hi @ybelkada , I gave that quantization example you provided a shot and I'm getting a weird result. GPU inference without quantization works fine but when I add the quantization config I'm now getting something like ['okay', 'okay'] when I run inference with a sample sentence. Seems to be just random tokens so I'm wondering if there's an issue with the quantization configuration.