MarianMT¶
Bugs: If you see something strange, file a Github Issue and assign @patrickvonplaten.
Translations should be similar, but not identical to output in the test set linked to in each model card.
Implementation Notes¶
Each model is about 298 MB on disk, there are more than 1,000 models.
The list of supported language pairs can be found here.
Models were originally trained by Jörg Tiedemann using the Marian C++ library, which supports fast training and translation.
All models are transformer encoder-decoders with 6 layers in each component. Each model’s performance is documented in a model card.
The 80 opus models that require BPE preprocessing are not supported.
The modeling code is the same as
BartForConditionalGeneration
with a few minor modifications:static (sinusoid) positional embeddings (
MarianConfig.static_position_embeddings=True
)a new final_logits_bias (
MarianConfig.add_bias_logits=True
)no layernorm_embedding (
MarianConfig.normalize_embedding=False
)the model starts generating with
pad_token_id
(which has 0 as a token_embedding) as the prefix (Bart uses<s/>
),
Code to bulk convert models can be found in
convert_marian_to_pytorch.py
.
Naming¶
All model names use the following format:
Helsinki-NLP/opus-mt-{src}-{tgt}
The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language code {code}”.
Codes formatted like
es_AR
are usuallycode_{region}
. That one is Spanish from Argentina.The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second group use a combination of ISO-639-5 codes and ISO-639-2 codes.
Examples¶
Since Marian models are smaller than many other translation models available in the library, they can be useful for fine-tuning experiments and integration tests.
Multilingual Models¶
All model names use the following format:
Helsinki-NLP/opus-mt-{src}-{tgt}
:If a model can output multiple languages, and you should specify a language code by prepending the desired output language to the
src_text
.You can see a models’s supported language codes in its model card, under target constituents, like in opus-mt-en-roa.
Note that if a model is only multilingual on the source side, like
Helsinki-NLP/opus-mt-roa-en
, no language codes are required.
New multi-lingual models from the Tatoeba-Challenge repo require 3 character language codes:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
'>>fra<< this is a sentence in english that we want to translate to french',
'>>por<< This should go to portuguese',
'>>esp<< And this to Spanish'
]
model_name = 'Helsinki-NLP/opus-mt-en-roa'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o portuguĂŞs.',
# 'Y esto al español']
Code to see available pretrained models:
from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
Old Style Multi-Lingual Models¶
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language group:
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
'Helsinki-NLP/opus-mt-ROMANCE-en',
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
'Helsinki-NLP/opus-mt-de-ZH',
'Helsinki-NLP/opus-mt-en-CELTIC',
'Helsinki-NLP/opus-mt-en-ROMANCE',
'Helsinki-NLP/opus-mt-es-NORWAY',
'Helsinki-NLP/opus-mt-fi-NORWAY',
'Helsinki-NLP/opus-mt-fi-ZH',
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
'Helsinki-NLP/opus-mt-sv-NORWAY',
'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Example of translating english to many romance languages, using old-style 2 character language codes
MarianConfig¶
-
class
transformers.
MarianConfig
(activation_dropout=0.0, extra_pos_embeddings=2, activation_function='gelu', vocab_size=50265, d_model=1024, encoder_ffn_dim=4096, encoder_layers=12, encoder_attention_heads=16, decoder_ffn_dim=4096, decoder_layers=12, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=1024, init_std=0.02, classifier_dropout=0.0, num_labels=3, is_encoder_decoder=True, normalize_before=False, add_final_layer_norm=False, do_blenderbot_90_layernorm=False, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, force_bos_token_to_be_generated=False, use_cache=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, **common_kwargs)[source]¶ This is the configuration class to store the configuration of a
MarianMTModel
. It is used to instantiate a Marian model according to the specified arguments, defining the model architecture.Configuration objects inherit from
PretrainedConfig
and can be used to control the model outputs. Read the documentation fromPretrainedConfig
for more information.- Parameters
vocab_size (
int
, optional, defaults to 58101) – Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingMarianMTModel
.d_model (
int
, optional, defaults to 512) – Dimensionality of the layers and the pooler layer.encoder_layers (
int
, optional, defaults to 6) – Number of encoder layers.decoder_layers (
int
, optional, defaults to 6) – Number of decoder layers.encoder_attention_heads (
int
, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder.decoder_attention_heads (
int
, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer decoder.decoder_ffn_dim (
int
, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.encoder_ffn_dim (
int
, optional, defaults to 2048) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.activation_function (
str
orfunction
, optional, defaults to"gelu"
) – The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and"gelu_new"
are supported.dropout (
float
, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.activation_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for activations inside the fully connected layer.classifier_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for classifier.max_position_embeddings (
int
, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).init_std (
float
, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.add_bias_logits (
bool
, optional, defaults toFalse
) – This should be completed, specific to marian.normalize_before (
bool
, optional, defaults toFalse
) – Call layernorm before attention ops.normalize_embedding (
bool
, optional, defaults toFalse
) – Call layernorm after embeddings.static_position_embeddings (
bool
, optional, defaults toTrue
) – Don’t learn positional embeddings, use sinusoidal.add_final_layer_norm (
bool
, optional, defaults toFalse
) – Why not add another layernorm?scale_embedding (
bool
, optional, defaults toFalse
) – Scale embeddings by diving by sqrt(d_model).eos_token_id (
int
, optional, defaults to 2) – End of stream token id.pad_token_id (
int
, optional, defaults to 1) – Padding token id.bos_token_id (
int
, optional, defaults to 0) – Beginning of stream token id.encoder_layerdrop – (
float
, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the LayerDrop paper for more details.decoder_layerdrop – (
float
, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the LayerDrop paper for more details.extra_pos_embeddings – (
int
, optional, defaults to 2): How many extra learned positional embeddings to use.is_encoder_decoder (
bool
, optional, defaults toTrue
) – Whether this is an encoder/decoder modelforce_bos_token_to_be_generated (
bool
, optional, defaults toFalse
) – Whether or not to force BOS token to be generated at step 1 (afterdecoder_start_token_id
).
MarianTokenizer¶
-
class
transformers.
MarianTokenizer
(vocab, source_spm, target_spm, source_lang=None, target_lang=None, unk_token='<unk>', eos_token='</s>', pad_token='<pad>', model_max_length=512, **kwargs)[source]¶ Construct a Marian tokenizer. Based on SentencePiece.
This tokenizer inherits from
PreTrainedTokenizer
which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.- Parameters
source_spm (
str
) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the source language.target_spm (
str
) – SentencePiece file (generally has a .spm extension) that contains the vocabulary for the target language.source_lang (
str
, optional) – A string representing the source language.target_lang (
str
, optional) – A string representing the target language.unk_token (
str
, optional, defaults to"<unk>"
) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.eos_token (
str
, optional, defaults to"</s>"
) – The end of sequence token.pad_token (
str
, optional, defaults to"<pad>"
) – The token used for padding, for example when batching sequences of different lengths.model_max_length (
int
, optional, defaults to 512) – The maximum sentence length the model accepts.additional_special_tokens (
List[str]
, optional, defaults to["<eop>", "<eod>"]
) – Additional special tokens used by the tokenizer.
Examples:
>>> from transformers import MarianTokenizer >>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de') >>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."] >>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional >>> batch_enc: BatchEncoding = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt") >>> # keys [input_ids, attention_mask, labels]. >>> # model(**batch) should work
-
prepare_seq2seq_batch
(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, return_tensors: Optional[str] = None, truncation=True, padding='longest', **unused) → transformers.tokenization_utils_base.BatchEncoding[source]¶ Prepare model inputs for translation. For best performance, translate one sentence at a time.
- Parameters
src_texts (
List[str]
) – List of documents to summarize or source language texts.tgt_texts (
list
, optional) – List of summaries or target language texts.max_length (
int
, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set toNone
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.max_target_length (
int
, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set toNone
, this will use the max_length value.padding (
bool
,str
orPaddingStrategy
, optional, defaults toFalse
) –Activates and controls padding. Accepts the following values:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
return_tensors (
str
orTensorType
, optional) –If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
truncation (
bool
,str
orTruncationStrategy
, optional, defaults toTrue
) –Activates and controls truncation. Accepts the following values:
True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
**kwargs – Additional keyword arguments passed along to
self.__call__
.
- Returns
A
BatchEncoding
with the following fields:input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
labels – List of token ids for tgt_texts.
The full set of keys
[input_ids, attention_mask, labels]
, will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.- Return type
MarianMTModel¶
-
class
transformers.
MarianMTModel
(config: transformers.models.bart.configuration_bart.BartConfig)[source]¶ Pytorch version of marian-nmt’s transformer.h (c++). Designed for the OPUS-NMT translation checkpoints. Available models are listed here.
This class overrides
BartForConditionalGeneration
. Please check the superclass for the appropriate documentation alongside usage examples.Examples:
>>> from transformers import MarianTokenizer, MarianMTModel >>> from typing import List >>> src = 'fr' # source language >>> trg = 'en' # target language >>> sample_text = "oĂą est l'arrĂŞt de bus ?" >>> mname = f'Helsinki-NLP/opus-mt-{src}-{trg}' >>> model = MarianMTModel.from_pretrained(mname) >>> tok = MarianTokenizer.from_pretrained(mname) >>> batch = tok.prepare_seq2seq_batch(src_texts=[sample_text], return_tensors="pt") # don't need tgt_text for inference >>> gen = model.generate(**batch) # for forward pass: model(**batch) >>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the bus stop ?"
TFMarianMTModel¶
-
class
transformers.
TFMarianMTModel
(*args, **kwargs)[source]¶ Marian model for machine translation
This model inherits from
TFBartForConditionalGeneration
. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with
input_ids
only and nothing else:model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
- Parameters
config (
MarianConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.