MarianMT¶

Bugs: If you see something strange, file a Github Issue and assign @patrickvonplaten.

Translations should be similar, but not identical to output in the test set linked to in each model card.

Implementation Notes¶

  • Each model is about 298 MB on disk, there are more than 1,000 models.

  • The list of supported language pairs can be found here.

  • Models were originally trained by Jörg Tiedemann using the Marian C++ library, which supports fast training and translation.

  • All models are transformer encoder-decoders with 6 layers in each component. Each model’s performance is documented in a model card.

  • The 80 opus models that require BPE preprocessing are not supported.

  • The modeling code is the same as BartForConditionalGeneration with a few minor modifications:

    • static (sinusoid) positional embeddings (MarianConfig.static_position_embeddings=True)

    • a new final_logits_bias (MarianConfig.add_bias_logits=True)

    • no layernorm_embedding (MarianConfig.normalize_embedding=False)

    • the model starts generating with pad_token_id (which has 0 as a token_embedding) as the prefix (Bart uses <s/>),

  • Code to bulk convert models can be found in convert_marian_to_pytorch.py.

Naming¶

  • All model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}

  • The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language code {code}”.

  • Codes formatted like es_AR are usually code_{region}. That one is Spanish from Argentina.

  • The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second group use a combination of ISO-639-5 codes and ISO-639-2 codes.

Examples¶

Multilingual Models¶

  • All model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}:

  • If a model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text.

  • You can see a models’s supported language codes in its model card, under target constituents, like in opus-mt-en-roa.

  • Note that if a model is only multilingual on the source side, like Helsinki-NLP/opus-mt-roa-en, no language codes are required.

New multi-lingual models from the Tatoeba-Challenge repo require 3 character language codes:

from transformers import MarianMTModel, MarianTokenizer
src_text = [
    '>>fra<< this is a sentence in english that we want to translate to french',
    '>>por<< This should go to portuguese',
    '>>esp<< And this to Spanish'
]

model_name = 'Helsinki-NLP/opus-mt-en-roa'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o portuguĂŞs.',
# 'Y esto al español']

Code to see available pretrained models:

from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]

Old Style Multi-Lingual Models¶

These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language group:

['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
 'Helsinki-NLP/opus-mt-de-ZH',
 'Helsinki-NLP/opus-mt-en-CELTIC',
 'Helsinki-NLP/opus-mt-en-ROMANCE',
 'Helsinki-NLP/opus-mt-es-NORWAY',
 'Helsinki-NLP/opus-mt-fi-NORWAY',
 'Helsinki-NLP/opus-mt-fi-ZH',
 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
 'Helsinki-NLP/opus-mt-sv-NORWAY',
 'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}

Example of translating english to many romance languages, using old-style 2 character language codes

MarianConfig¶

class transformers.MarianConfig(activation_dropout=0.0, extra_pos_embeddings=2, activation_function='gelu', vocab_size=50265, d_model=1024, encoder_ffn_dim=4096, encoder_layers=12, encoder_attention_heads=16, decoder_ffn_dim=4096, decoder_layers=12, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=1024, init_std=0.02, classifier_dropout=0.0, num_labels=3, is_encoder_decoder=True, normalize_before=False, add_final_layer_norm=False, do_blenderbot_90_layernorm=False, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, force_bos_token_to_be_generated=False, use_cache=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, **common_kwargs)[source]¶

This is the configuration class to store the configuration of a MarianMTModel. It is used to instantiate a Marian model according to the specified arguments, defining the model architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Parameters
  • vocab_size (int, optional, defaults to 58101) – Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MarianMTModel.

  • d_model (int, optional, defaults to 512) – Dimensionality of the layers and the pooler layer.

  • encoder_layers (int, optional, defaults to 6) – Number of encoder layers.

  • decoder_layers (int, optional, defaults to 6) – Number of decoder layers.

  • encoder_attention_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.

  • decoder_attention_heads (int, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer decoder.

  • decoder_ffn_dim (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.

  • encoder_ffn_dim (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.

  • activation_function (str or function, optional, defaults to "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

  • dropout (float, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • activation_dropout (float, optional, defaults to 0.0) – The dropout ratio for activations inside the fully connected layer.

  • classifier_dropout (float, optional, defaults to 0.0) – The dropout ratio for classifier.

  • max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • init_std (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • add_bias_logits (bool, optional, defaults to False) – This should be completed, specific to marian.

  • normalize_before (bool, optional, defaults to False) – Call layernorm before attention ops.

  • normalize_embedding (bool, optional, defaults to False) – Call layernorm after embeddings.

  • static_position_embeddings (bool, optional, defaults to True) – Don’t learn positional embeddings, use sinusoidal.

  • add_final_layer_norm (bool, optional, defaults to False) – Why not add another layernorm?

  • scale_embedding (bool, optional, defaults to False) – Scale embeddings by diving by sqrt(d_model).

  • eos_token_id (int, optional, defaults to 2) – End of stream token id.

  • pad_token_id (int, optional, defaults to 1) – Padding token id.

  • bos_token_id (int, optional, defaults to 0) – Beginning of stream token id.

  • encoder_layerdrop – (float, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the LayerDrop paper for more details.

  • decoder_layerdrop – (float, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the LayerDrop paper for more details.

  • extra_pos_embeddings – (int, optional, defaults to 2): How many extra learned positional embeddings to use.

  • is_encoder_decoder (bool, optional, defaults to True) – Whether this is an encoder/decoder model

  • force_bos_token_to_be_generated (bool, optional, defaults to False) – Whether or not to force BOS token to be generated at step 1 (after decoder_start_token_id).

MarianTokenizer¶

class transformers.MarianTokenizer(vocab, source_spm, target_spm, source_lang=None, target_lang=None, unk_token='<unk>', eos_token='</s>', pad_token='<pad>', model_max_length=512, **kwargs)[source]¶

Construct a Marian tokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • source_spm (str) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the source language.

  • target_spm (str) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the target language.

  • source_lang (str, optional) – A string representing the source language.

  • target_lang (str, optional) – A string representing the target language.

  • unk_token (str, optional, defaults to "<unk>") – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • eos_token (str, optional, defaults to "</s>") – The end of sequence token.

  • pad_token (str, optional, defaults to "<pad>") – The token used for padding, for example when batching sequences of different lengths.

  • model_max_length (int, optional, defaults to 512) – The maximum sentence length the model accepts.

  • additional_special_tokens (List[str], optional, defaults to ["<eop>", "<eod>"]) – Additional special tokens used by the tokenizer.

Examples:

>>> from transformers import MarianTokenizer
>>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
>>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
>>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."]  # optional
>>> batch_enc: BatchEncoding = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt")
>>> # keys  [input_ids, attention_mask, labels].
>>> # model(**batch) should work
prepare_seq2seq_batch(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, return_tensors: Optional[str] = None, truncation=True, padding='longest', **unused) → transformers.tokenization_utils_base.BatchEncoding[source]¶

Prepare model inputs for translation. For best performance, translate one sentence at a time.

Parameters
  • src_texts (List[str]) – List of documents to summarize or source language texts.

  • tgt_texts (list, optional) – List of summaries or target language texts.

  • max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set to None, this will use the max_length value.

  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).

    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.

    • 'pt': Return PyTorch torch.Tensor objects.

    • 'np': Return Numpy np.ndarray objects.

  • truncation (bool, str or TruncationStrategy, optional, defaults to True) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.

    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.

    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.

    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).

  • **kwargs – Additional keyword arguments passed along to self.__call__.

Returns

A BatchEncoding with the following fields:

  • input_ids – List of token ids to be fed to the encoder.

  • attention_mask – List of indices specifying which tokens should be attended to by the model.

  • labels – List of token ids for tgt_texts.

The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Return type

BatchEncoding

MarianMTModel¶

class transformers.MarianMTModel(config: transformers.models.bart.configuration_bart.BartConfig)[source]¶

Pytorch version of marian-nmt’s transformer.h (c++). Designed for the OPUS-NMT translation checkpoints. Available models are listed here.

This class overrides BartForConditionalGeneration. Please check the superclass for the appropriate documentation alongside usage examples.

Examples:

>>> from transformers import MarianTokenizer, MarianMTModel
>>> from typing import List
>>> src = 'fr'  # source language
>>> trg = 'en'  # target language
>>> sample_text = "oĂą est l'arrĂŞt de bus ?"
>>> mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

>>> model = MarianMTModel.from_pretrained(mname)
>>> tok = MarianTokenizer.from_pretrained(mname)
>>> batch = tok.prepare_seq2seq_batch(src_texts=[sample_text], return_tensors="pt")  # don't need tgt_text for inference
>>> gen = model.generate(**batch)  # for forward pass: model(**batch)
>>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the bus stop ?"

TFMarianMTModel¶

class transformers.TFMarianMTModel(*args, **kwargs)[source]¶

Marian model for machine translation This model inherits from TFBartForConditionalGeneration. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

Note

TF 2.0 models accepts two formats as inputs:

  • having all inputs as keyword arguments (like PyTorch models), or

  • having all inputs as a list, tuple or dict in the first positional arguments.

This second option is useful when using tf.keras.Model.fit() method which currently requires having all the tensors in the first argument of the model call function: model(inputs).

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

  • a single Tensor with input_ids only and nothing else: model(input_ids)

  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask]) or model([input_ids, attention_mask, token_type_ids])

  • a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})

Parameters

config (MarianConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.