DISCLAIMER: If you see something strange, file a Github Issue .


The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.

The abstract of the paper is the following:

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

The authors’ code can be found here .

Implementation Notes


Here is an example of model usage:

>>> from transformers import BlenderbotSmallTokenizer, BlenderbotForConditionalGeneration
>>> mname = 'facebook/blenderbot-90M'
>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
>>> tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname)
>>> UTTERANCE = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
>>> reply_ids = model.generate(**inputs)
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in reply_ids])

Here is how you can check out config values:

>>> from transformers import BlenderbotConfig
>>> config_90 = BlenderbotConfig.from_pretrained("facebook/blenderbot-90M")
>>> config_90.to_diff_dict()  # show interesting Values.
>>> configuration_3B = BlenderbotConfig("facebook/blenderbot-3B")
>>> configuration_3B.to_diff_dict()


class transformers.BlenderbotConfig(activation_dropout=0.0, extra_pos_embeddings=0, activation_function='gelu', vocab_size=54944, d_model=512, encoder_ffn_dim=2048, encoder_layers=8, encoder_attention_heads=16, decoder_ffn_dim=2048, decoder_layers=8, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=512, classifier_dropout=0.0, is_encoder_decoder=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, normalize_before=False, add_final_layer_norm=False, do_blenderbot_90_layernorm=True, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, force_bos_token_to_be_generated=False, **common_kwargs)[source]

This is the configuration class to store the configuration of a BlenderbotForConditionalGeneration. It inherits from BartConfig and has the same signature with different defaults.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

  • vocab_size (int, optional, defaults to 54944) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BlenderbotForConditionalGeneration.

  • d_model (int, optional, defaults to 512) – Dimensionality of the layers and the pooler layer.

  • encoder_layers (int, optional, defaults to 8) – Number of encoder layers, 6 are used for the blenderbot-90M model.

  • decoder_layers (int, optional, defaults to 8) – Number of decoder layers, 6 are used for the blenderbot-90M model.

  • encoder_attention_heads (int, optional, defaults to 16) – Number of attention heads for each attention layer in the Transformer encoder.

  • decoder_attention_heads (int, optional, defaults to 16) – Number of attention heads for each attention layer in the Transformer decoder.

  • decoder_ffn_dim (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.

  • encoder_ffn_dim (int, optional, defaults to 2048) – Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.

  • activation_function (str or function, optional, defaults to "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

  • dropout (float, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • attention_dropout (float, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • activation_dropout (float, optional, defaults to 0.0) – The dropout ratio for activations inside the fully connected layer.

  • classifier_dropout (float, optional, defaults to 0.0) – The dropout ratio for classifier.

  • max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • init_std (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • add_bias_logits (bool, optional, defaults to False) – This should be completed, specific to marian.

  • normalize_before (bool, optional, defaults to False) – Call layernorm before attention ops.

  • normalize_embedding (bool, optional, defaults to True) – Call layernorm after embeddings.

  • static_position_embeddings (bool, optional, defaults to False) – Don’t learn positional embeddings, use sinusoidal.

  • add_final_layer_norm (bool, optional, defaults to False) – Why not add another layernorm?

  • do_blenderbot_90_layernorm (bool, optional, defaults to True) – Blenderbot-90m checkpoint uses layernorm_embedding one line earlier in the decoder.

  • scale_embedding (bool, optional, defaults to False) – Scale embeddings by diving by sqrt(d_model).

  • eos_token_id (int, optional, defaults to 2) – End of stream token id.

  • pad_token_id (int, optional, defaults to 1) – Padding token id.

  • bos_token_id (int, optional, defaults to 0) – Beginning of stream token id.

  • encoder_layerdrop – (float, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the LayerDrop paper for more details.

  • decoder_layerdrop – (float, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the LayerDrop paper for more details.

  • extra_pos_embeddings – (int, optional, defaults to 2): How many extra learned positional embeddings to use. Should be set to pad_token_id+1.

  • is_encoder_decoder (bool, optional, defaults to True) – Whether this is an encoder/decoder model.

  • force_bos_token_to_be_generated (bool, optional, defaults to False) – Whether or not to force BOS token to be generated at step 1 (after decoder_start_token_id),


class transformers.BlenderbotTokenizer(vocab_file, merges_file, errors='replace', bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', add_prefix_space=False, **kwargs)[source]

Construct a Blenderbot tokenizer.

Blenderbot is nearly identical to RobertaTokenizer and runs end-to-end tokenization: punctuation splitting and wordpiece. The only difference is that it doesnt add BOS token to the beginning of sequences.

Refer to superclass RobertaTokenizer for usage examples and documentation concerning parameters.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] = None)[source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A Blenderbot sequence has the following format:

  • single sequence: `` X </s>``

  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added

  • token_ids_1 (List[int], optional) – Will be ignored


list of input IDs with the appropriate special tokens.

Return type



class transformers.BlenderbotSmallTokenizer(vocab_file, merges_file, bos_token='__start__', eos_token='__end__', unk_token='__unk__', pad_token='__null__', **kwargs)[source]

Constructs a Blenderbot-90M tokenizer based on BPE (Byte-Pair-Encoding)

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to the superclass for more information regarding methods.

  • vocab_file (str) – File containing the vocabulary.

  • merges_file (str) – Path to the merges file.

  • bos_token (str, optional, defaults to "__start__") – The beginning of sentence token.

  • eos_token (str, optional, defaults to "__end__") – The end of sentence token.

  • unk_token (str, optional, defaults to "__unk__") – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "__pad__") – The token used for padding, for example when batching sequences of different lengths.

  • **kwargs – Additional keyword arguments passed along to PreTrainedTokenizer

convert_tokens_to_string(tokens: List[str]) → str[source]

Converts a sequence of tokens in a single string.

get_vocab() → Dict[source]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.


The vocabulary.

Return type

Dict[str, int]

save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) → Tuple[str][source]

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use _save_pretrained() to save the whole state of the tokenizer.

  • save_directory (str) – The directory in which to save the vocabulary.

  • filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.


Paths to the files saved.

Return type


property vocab_size

Size of the base vocabulary (without the added tokens).




See transformers.BartForConditionalGeneration for arguments to forward and generate

class transformers.BlenderbotForConditionalGeneration(config: transformers.configuration_bart.BartConfig)[source]

The BART Model with a language modeling head. Can be used for summarization.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

This class overrides BartForConditionalGeneration. Please check the superclass for the appropriate documentation alongside usage examples.

adjust_logits_during_generation(logits, cur_len, max_length)[source]

Implement in subclasses of PreTrainedModel for custom behavior to adjust the logits in the generate method.


alias of transformers.configuration_blenderbot.BlenderbotConfig


See transformers.TFBartForConditionalGeneration for arguments to forward and generate

class transformers.TFBlenderbotForConditionalGeneration(*args, **kwargs)[source]

Blenderbot model for open domain dialogue

This model inherits from TFBartForConditionalGeneration. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.


TF 2.0 models accepts two formats as inputs:

  • having all inputs as keyword arguments (like PyTorch models), or

  • having all inputs as a list, tuple or dict in the first positional arguments.

This second option is useful when using method which currently requires having all the tensors in the first argument of the model call function: model(inputs).

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

  • a single Tensor with input_ids only and nothing else: model(inputs_ids)

  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask]) or model([input_ids, attention_mask, token_type_ids])

  • a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})


config (BlenderbotConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

adjust_logits_during_generation(logits, cur_len, max_length)[source]

Never predict pad_token_id. Predict </s> when max_length is reached.


alias of transformers.configuration_blenderbot.BlenderbotConfig