How to skip the translation of some tokens?

#2
by deepak-llm-art - opened

I want to skip some tokens in output

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("utrobinmv/t5_translate_en_ru_zh_large_1024")
tokenizer = AutoTokenizer.from_pretrained("utrobinmv/t5_translate_en_ru_zh_large_1024")

text = "I have 3 PDF files."
prompt = f"translate to ru: {text}"
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="eng_Latn", tgt_lang='deu_Latn')
translator(text)

ouput

У меня есть 3 ПДФ-файла.

I want like this

У меня есть 3 PDF файла.

From the point of view of the Russian language, this translation is also correct. Therefore, you should use fine tune to retrain the network to suit your style. Perhaps there are other ways, can someone else suggest another solution?

Sign up or log in to comment