Migrating from previous packages
Migrating from transformers v3.x
to v4.x
A couple of changes were introduced when the switch from version 3 to version 4 was done. Below is a summary of the expected changes:
1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.
The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set.
This introduces two breaking changes:
- The handling of overflowing tokens between the python and rust tokenizers is different.
- The rust tokenizers do not accept integers in the encoding methods.
How to obtain the same behavior as v3.x in v4.x
- The pipelines now contain additional features out of the box. See the token-classification pipeline with the
grouped_entities
flag. - The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the
use_fast
flag by setting it toFalse
:
In version v3.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
to obtain the same in version v4.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
2. SentencePiece is removed from the required dependencies
The requirement on the SentencePiece dependency has been lifted from the setup.py
. This is done so that we may have a channel on anaconda cloud without relying on conda-forge
. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers
installation.
This includes the slow versions of:
XLNetTokenizer
AlbertTokenizer
CamembertTokenizer
MBartTokenizer
PegasusTokenizer
T5Tokenizer
ReformerTokenizer
XLMRobertaTokenizer
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should install sentencepiece
additionally:
In version v3.x
:
pip install transformers
to obtain the same in version v4.x
:
pip install transformers[sentencepiece]
or
pip install transformers sentencepiece
3. The architecture of the repo has been updated so that each model resides in its folder
The past and foreseeable addition of new models means that the number of files in the directory src/transformers
keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.
This is a breaking change as importing intermediary layers using a model’s module directly needs to be done via a different path.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should update the path used to access the layers.
In version v3.x
:
from transformers.modeling_bert import BertLayer
to obtain the same in version v4.x
:
from transformers.models.bert.modeling_bert import BertLayer
4. Switching the return_dict
argument to True
by default
The return_dict
argument enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs
will not work.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should specify the return_dict
argument to False
, either in the model configuration or during the forward pass.
In version v3.x
:
model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs)
to obtain the same in version v4.x
:
model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs, return_dict=False)
or
model = BertModel.from_pretrained("bert-base-cased", return_dict=False)
outputs = model(**inputs)
5. Removed some deprecated attributes
Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.
Here is a list of these attributes/methods/arguments and what their replacements should be:
In several models, the labels become consistent with the other models:
masked_lm_labels
becomeslabels
inAlbertForMaskedLM
andAlbertForPreTraining
.masked_lm_labels
becomeslabels
inBertForMaskedLM
andBertForPreTraining
.masked_lm_labels
becomeslabels
inDistilBertForMaskedLM
.masked_lm_labels
becomeslabels
inElectraForMaskedLM
.masked_lm_labels
becomeslabels
inLongformerForMaskedLM
.masked_lm_labels
becomeslabels
inMobileBertForMaskedLM
.masked_lm_labels
becomeslabels
inRobertaForMaskedLM
.lm_labels
becomeslabels
inBartForConditionalGeneration
.lm_labels
becomeslabels
inGPT2DoubleHeadsModel
.lm_labels
becomeslabels
inOpenAIGPTDoubleHeadsModel
.lm_labels
becomeslabels
inT5ForConditionalGeneration
.
In several models, the caching mechanism becomes consistent with the other models:
decoder_cached_states
becomespast_key_values
in all BART-like, FSMT and T5 models.decoder_past_key_values
becomespast_key_values
in all BART-like, FSMT and T5 models.past
becomespast_key_values
in all CTRL models.past
becomespast_key_values
in all GPT-2 models.
Regarding the tokenizer classes:
- The tokenizer attribute
max_len
becomesmodel_max_length
. - The tokenizer attribute
return_lengths
becomesreturn_length
. - The tokenizer encoding argument
is_pretokenized
becomesis_split_into_words
.
Regarding the Trainer
class:
- The
Trainer
argumenttb_writer
is removed in favor of the callbackTensorBoardCallback(tb_writer=...)
. - The
Trainer
argumentprediction_loss_only
is removed in favor of the class argumentargs.prediction_loss_only
. - The
Trainer
attributedata_collator
should be a callable. - The
Trainer
method_log
is deprecated in favor oflog
. - The
Trainer
method_training_step
is deprecated in favor oftraining_step
. - The
Trainer
method_prediction_loop
is deprecated in favor ofprediction_loop
. - The
Trainer
methodis_local_master
is deprecated in favor ofis_local_process_zero
. - The
Trainer
methodis_world_master
is deprecated in favor ofis_world_process_zero
.
Regarding the TFTrainer
class:
- The
TFTrainer
argumentprediction_loss_only
is removed in favor of the class argumentargs.prediction_loss_only
. - The
Trainer
method_log
is deprecated in favor oflog
. - The
TFTrainer
method_prediction_loop
is deprecated in favor ofprediction_loop
. - The
TFTrainer
method_setup_wandb
is deprecated in favor ofsetup_wandb
. - The
TFTrainer
method_run_model
is deprecated in favor ofrun_model
.
Regarding the TrainingArguments
class:
- The
TrainingArguments
argumentevaluate_during_training
is deprecated in favor ofevaluation_strategy
.
Regarding the Transfo-XL model:
- The Transfo-XL configuration attribute
tie_weight
becomestie_words_embeddings
. - The Transfo-XL modeling method
reset_length
becomesreset_memory_length
.
Regarding pipelines:
- The
FillMaskPipeline
argumenttopk
becomestop_k
.
Migrating from pytorch-transformers to 🤗 Transformers
Here is a quick summary of what you should take care of when migrating from pytorch-transformers
to 🤗 Transformers.
Positional order of some models' keywords inputs (attention_mask
, token_type_ids
...) changed
To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask
, token_type_ids
…) has been changed.
If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
, this should not cause any change.
If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids)
, you may have to double check the exact order of input arguments.
Migrating from pytorch-pretrained-bert
Here is a quick summary of what you should take care of when migrating from pytorch-pretrained-bert
to 🤗 Transformers
Models always output tuples
The main breaking change when migrating from pytorch-pretrained-bert
to 🤗 Transformers is that the models forward method always outputs a tuple
with various elements depending on the model and the configuration parameters.
The exact content of the tuples for each model are detailed in the models’ docstrings and the documentation.
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in pytorch-pretrained-bert
.
Here is a pytorch-pretrained-bert
to 🤗 Transformers conversion example for a BertForSequenceClassification
classification model:
# Let's load our model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)
# Now just use this line in 🤗 Transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]
# In 🤗 Transformers you can also have access to the logits:
loss, logits = outputs[:2]
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
Serialization
Breaking change in the from_pretrained()
method:
Models are now set in evaluation mode by default when instantiated with the
from_pretrained()
method. To train them don’t forget to set them back in training mode (model.train()
) to activate the dropout modules.The additional
*inputs
and**kwargs
arguments supplied to thefrom_pretrained()
method used to be directly passed to the underlying model’s class__init__()
method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previousBertForSequenceClassification
examples. More precisely, the positional arguments*inputs
provided tofrom_pretrained()
are directly forwarded the model__init__()
method while the keyword arguments**kwargs
(i) which match configuration class attributes are used to update said attributes (ii) which don’t match any configuration class attributes are forwarded to the model__init__()
method.
Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory)
if you were using any other serialization method before.
Here is an example:
### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)
### Now let's save our model and tokenizer to a directory
model.save_pretrained("./my_saved_model_directory/")
tokenizer.save_pretrained("./my_saved_model_directory/")
### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/")
tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/")
Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
The two optimizers previously included, BertAdam
and OpenAIAdam
, have been replaced by a single AdamW
optimizer which has a few differences:
- it only implements weights decay correction,
- schedules are now externals (see below),
- gradient clipping is now also external (see below).
The new optimizer AdamW
matches PyTorch Adam
optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
The schedules are now standard PyTorch learning rate schedulers and not part of the optimizer anymore.
Here is a conversion examples from BertAdam
with a linear warmup and decay schedule to AdamW
and the same schedule:
# Parameters:
lr = 1e-3
max_grad_norm = 1.0
num_training_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_training_steps) # 0.1
### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(
model.parameters(),
lr=lr,
schedule="warmup_linear",
warmup=warmup_proportion,
num_training_steps=num_training_steps,
)
### and used like this:
for batch in train_data:
loss = model(batch)
loss.backward()
optimizer.step()
### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
optimizer = AdamW(
model.parameters(), lr=lr, correct_bias=False
) # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
) # PyTorch scheduler
### and used like this:
for batch in train_data:
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_grad_norm
) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
optimizer.step()
scheduler.step()