Edit model card

The model and the tokenizer are based on facebook/nllb-200-distilled-600M.

We trained the model to use one sentence of context and a single term of the terminology-constraint. The context is prepended to the input sentence with the sep_token in between. The term should be in the target language and be postpended to the input sentence with the sep_token in between. In case of no terminology constraint, the sep_token should also be added. We used a subset of the OpenSubtitles2018 dataset for training. We trained on the interleaved dataset for all directions between the following languages: English, German, Dutch, Spanish, Italian, and Greek.

The tokenizer of the base model was not changed. For the language codes, see the base model.

Use this code for translation:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 'voxreality/src_ctx_and_term_nllb_600M'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

max_length = 100
src_lang = 'eng_Latn'
tgt_lang = 'deu_Latn'
context_text = 'This is an optional context sentence.'
target_term = 'text'  # term to be used in the target language
sentence_text = 'Text to be translated.'

# if a context and a term are provided use the following:
input_text = f'{context_text} {tokenizer.sep_token} {sentence_text} {tokenizer.sep_token} {target_term}'
# if no context but a term is provided use the following:
# input_text = f'{sentence_text} {tokenizer.sep_token} {target_term}'
# if a context is provided but no term use the following:
# input_text = f'{context_text} {tokenizer.sep_token} {sentence_text} {tokenizer.sep_token}'
# if not context nor term is provided use the following:
# input_text = f'{sentence_text} {tokenizer.sep_token}'

tokenizer.src_lang = src_lang
inputs = tokenizer(input_text, return_tensors='pt').to(model.device)
model_output = model.generate(**inputs,
                              forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
                              max_length=max_length)
output_text = tokenizer.batch_decode(model_output, skip_special_tokens=True)[0]

print(output_text)

You can also use the pipeline

from transformers import pipeline

model_name = 'voxreality/src_ctx_and_term_nllb_600M'
translation_pipeline = pipeline("translation", model=model_name)
sep_token = translation_pipeline.tokenizer.sep_token
src_lang = 'eng_Latn'
tgt_lang = 'deu_Latn'
context_text = 'This is an optional context sentence.'
target_term = 'text'  # term to be used in the target language
sentence_text = 'Text to be translated.'


# if a context and a term are provided use the following:
input_texts = [f'{context_text} {sep_token} {sentence_text} {sep_token} {target_term}']
# if no context but a term is provided use the following:
# input_texts = [f'{sentence_text} {sep_token} {target_term}']
# if a context is provided but no term use the following:
# input_texts = [f'{context_text} {sep_token} {sentence_text} {sep_token}']
# if not context nor term is provided use the following:
# input_texts = [f'{sentence_text} {sep_token}']

pipeline_output = translation_pipeline(input_texts, src_lang=src_lang, tgt_lang=tgt_lang)

print(pipeline_output[0]['translation_text'])
Downloads last month
18
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.