[QUESTION] How to skip the translation of some tokens?

#7
by hkad98 - opened

Hello community,
is there a way to skip a translation of selected tokens? Let's say that my input sequence contains numbers written using digits, and I would like to keep that in my output sequence. Unfortunately, this does not work for the following setup:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

text = "I have 3 dogs."

translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="eng_Latn", tgt_lang='deu_Latn')
translator(text)
# [{'translation_text': 'Ich habe drei Hunde.'}]

If you don't face grammatical errors, first replace them with some special tokens like 1_1_1_1, and then after translation replace them with your desired words.

Did you find any solution ?

Sign up or log in to comment