--- license: mit language: - wo - fr metrics: - bleu pipeline_tag: translation tags: - text-generation-inference --- # Model Documentation: Wolof to French Translation with NLLB-200 ## Model Overview This document describes a machine translation model fine-tuned from Meta's NLLB-200 for translating from Wolof to French. The model, hosted at `cifope/nllb-200-wo-fr-distilled-600M`, utilizes a distilled version of the NLLB-200 model which has been specifically optimized for translation tasks between the Wolof and French languages. ## Dependencies The model requires the `transformers` library by Hugging Face. Ensure that you have the library installed: ```bash pip install transformers ``` ## Setup Import necessary classes from the `transformers` library: ```python from transformers import AutoModelForSeq2SeqLM, NllbTokenizer ``` Initialize the model and tokenizer: ```python model = AutoModelForSeq2SeqLM.from_pretrained('cifope/nllb-200-wo-fr-distilled-600M') tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M') ``` ## Tokenizer Customization To integrate specific features like new language codes into the tokenizer, you can use the `fix_tokenizer` function: ```python def fix_tokenizer(tokenizer, new_lang='wol_Wol'): old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder) tokenizer.lang_code_to_id[new_lang] = old_len-1 tokenizer.id_to_lang_code[old_len-1] = new_lang tokenizer.fairseq_tokens_to_ids[""] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id) tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()} if new_lang not in tokenizer._additional_special_tokens: tokenizer._additional_special_tokens.append(new_lang) tokenizer.added_tokens_encoder = {} tokenizer.added_tokens_decoder = {} fix_tokenizer(tokenizer) ``` ## Translation Functions ### Translate from French to Wolof The `translate` function translates text from French to Wolof: ```python def translate(text, src_lang='fra_Latn', tgt_lang='wol_Wol', a=16, b=1.5, max_input_length=1024, **kwargs): tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length) result = model.generate( **inputs.to(model.device), forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), max_new_tokens=int(a + b * inputs.input_ids.shape[1]), **kwargs ) return tokenizer.batch_decode(result, skip_special_tokens=True) ``` ### Translate from Wolof to French The `reversed_translate` function translates text from Wolof to French: ```python def reversed_translate(text, src_lang='wol_Wol', tgt_lang='fra_Latn', a=16, b=1.5, max_input_length=1024, **kwargs): tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length) result = model.generate( **inputs.to(model.device), forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), max_new_tokens=int(a + b * inputs.input_ids.shape[1]), **kwargs ) return tokenizer.batch_decode(result, skip_special_tokens=True) ``` ## Usage To use the model for translating text, simply call the `translate` or `reversed_translate` function with the appropriate text and parameters. For example: ```python french_text = "L'argent peut être échangé à la seule banque des îles située à Stanley" wolof_translation = translate(french_text) print(wolof_translation) wolof_text = "alkaati yi tàmbali nañu xàll léegi kilifa gi ñów" french_translation = reversed_translate(wolof_text) print(french_translation) ```