Migrating from previous packages

Migrating from transformers v3.x to v4.x

A couple of changes were introduced when the switch from version 3 to version 4 was done. Below is a summary of the expected changes:

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set.

This introduces two breaking changes:

  • The handling of overflowing tokens between the python and rust tokenizers is different.

  • The rust tokenizers do not accept integers in the encoding methods.

How to obtain the same behavior as v3.x in v4.x

In version v3.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

to obtain the same in version v4.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

2. SentencePiece is removed from the required dependencies

The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

This includes the slow versions of:

  • XLNetTokenizer

  • AlbertTokenizer

  • CamembertTokenizer

  • MBartTokenizer

  • PegasusTokenizer

  • T5Tokenizer

  • ReformerTokenizer

  • XLMRobertaTokenizer

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should install sentencepiece additionally:

In version v3.x:

pip install transformers

to obtain the same in version v4.x:

pip install transformers[sentencepiece]

or

pip install transformers sentencepiece

3. The architecture of the repo has been updated so that each model resides in its folder

The past and foreseeable addition of new models means that the number of files in the directory src/transformers keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.

This is a breaking change as importing intermediary layers using a model’s module directly needs to be done via a different path.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should update the path used to access the layers.

In version v3.x:

from transformers.modeling_bert import BertLayer

to obtain the same in version v4.x:

from transformers.models.bert.modeling_bert import BertLayer

4. Switching the return_dict argument to True by default

The return_dict argument enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.

This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs will not work.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should specify the return_dict argument to False, either in the model configuration or during the forward pass.

In version v3.x:

model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs)

to obtain the same in version v4.x:

model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs, return_dict=False)

or

model = BertModel.from_pretrained("bert-base-cased", return_dict=False)
outputs = model(**inputs)

5. Removed some deprecated attributes

Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.

Here is a list of these attributes/methods/arguments and what their replacements should be:

In several models, the labels become consistent with the other models:

  • masked_lm_labels becomes labels in AlbertForMaskedLM and AlbertForPreTraining.

  • masked_lm_labels becomes labels in BertForMaskedLM and BertForPreTraining.

  • masked_lm_labels becomes labels in DistilBertForMaskedLM.

  • masked_lm_labels becomes labels in ElectraForMaskedLM.

  • masked_lm_labels becomes labels in LongformerForMaskedLM.

  • masked_lm_labels becomes labels in MobileBertForMaskedLM.

  • masked_lm_labels becomes labels in RobertaForMaskedLM.

  • lm_labels becomes labels in BartForConditionalGeneration.

  • lm_labels becomes labels in GPT2DoubleHeadsModel.

  • lm_labels becomes labels in OpenAIGPTDoubleHeadsModel.

  • lm_labels becomes labels in T5ForConditionalGeneration.

In several models, the caching mechanism becomes consistent with the other models:

  • decoder_cached_states becomes past_key_values in all BART-like, FSMT and T5 models.

  • decoder_past_key_values becomes past_key_values in all BART-like, FSMT and T5 models.

  • past becomes past_key_values in all CTRL models.

  • past becomes past_key_values in all GPT-2 models.

Regarding the tokenizer classes:

  • The tokenizer attribute max_len becomes model_max_length.

  • The tokenizer attribute return_lengths becomes return_length.

  • The tokenizer encoding argument is_pretokenized becomes is_split_into_words.

Regarding the Trainer class:

  • The Trainer argument tb_writer is removed in favor of the callback TensorBoardCallback(tb_writer=...).

  • The Trainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.

  • The Trainer attribute data_collator should be a callable.

  • The Trainer method _log is deprecated in favor of log.

  • The Trainer method _training_step is deprecated in favor of training_step.

  • The Trainer method _prediction_loop is deprecated in favor of prediction_loop.

  • The Trainer method is_local_master is deprecated in favor of is_local_process_zero.

  • The Trainer method is_world_master is deprecated in favor of is_world_process_zero.

Regarding the TFTrainer class:

  • The TFTrainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.

  • The Trainer method _log is deprecated in favor of log.

  • The TFTrainer method _prediction_loop is deprecated in favor of prediction_loop.

  • The TFTrainer method _setup_wandb is deprecated in favor of setup_wandb.

  • The TFTrainer method _run_model is deprecated in favor of run_model.

Regarding the TrainingArguments class:

  • The TrainingArguments argument evaluate_during_training is deprecated in favor of evaluation_strategy.

Regarding the Transfo-XL model:

  • The Transfo-XL configuration attribute tie_weight becomes tie_words_embeddings.

  • The Transfo-XL modeling method reset_length becomes reset_memory_length.

Regarding pipelines:

  • The FillMaskPipeline argument topk becomes top_k.

Migrating from pytorch-transformers to 🤗 Transformers

Here is a quick summary of what you should take care of when migrating from pytorch-transformers to 🤗 Transformers.

Positional order of some models’ keywords inputs (attention_mask, token_type_ids…) changed

To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask, token_type_ids…) has been changed.

If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids), this should not cause any change.

If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids), you may have to double check the exact order of input arguments.

Migrating from pytorch-pretrained-bert

Here is a quick summary of what you should take care of when migrating from pytorch-pretrained-bert to 🤗 Transformers

Models always output tuples

The main breaking change when migrating from pytorch-pretrained-bert to 🤗 Transformers is that the models forward method always outputs a tuple with various elements depending on the model and the configuration parameters.

The exact content of the tuples for each model are detailed in the models’ docstrings and the documentation.

In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in pytorch-pretrained-bert.

Here is a pytorch-pretrained-bert to 🤗 Transformers conversion example for a BertForSequenceClassification classification model:

# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in 🤗 Transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

# In 🤗 Transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs

Serialization

Breaking change in the from_pretrained()method:

  1. Models are now set in evaluation mode by default when instantiated with the from_pretrained() method. To train them don’t forget to set them back in training mode (model.train()) to activate the dropout modules.

  2. The additional *inputs and **kwargs arguments supplied to the from_pretrained() method used to be directly passed to the underlying model’s class __init__() method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous BertForSequenceClassification examples. More precisely, the positional arguments *inputs provided to from_pretrained() are directly forwarded the model __init__() method while the keyword arguments **kwargs (i) which match configuration class attributes are used to update said attributes (ii) which don’t match any configuration class attributes are forwarded to the model __init__() method.

Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory) if you were using any other serialization method before.

Here is an example:

### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)

### Now let's save our model and tokenizer to a directory
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

The two optimizers previously included, BertAdam and OpenAIAdam, have been replaced by a single AdamW optimizer which has a few differences:

  • it only implements weights decay correction,

  • schedules are now externals (see below),

  • gradient clipping is now also external (see below).

The new optimizer AdamW matches PyTorch Adam optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.

The schedules are now standard PyTorch learning rate schedulers and not part of the optimizer anymore.

Here is a conversion examples from BertAdam with a linear warmup and decay schedule to AdamW and the same schedule:

# Parameters:
lr = 1e-3
max_grad_norm = 1.0
num_training_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1

### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, num_training_steps=num_training_steps)
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    optimizer.step()

### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    optimizer.step()
    scheduler.step()