CTRL

Note: if you fine-tune a CTRL model using the Salesforce code (https://github.com/salesforce/ctrl), you’ll be able to convert from TF to our HuggingFace/Transformers format using the convert_tf_to_huggingface_pytorch.py script (see issue #1654).

CTRLConfig

class transformers.CTRLConfig(vocab_size_or_config_json_file=246534, n_positions=256, n_ctx=256, n_embd=1280, dff=8192, n_layer=48, n_head=16, resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, layer_norm_epsilon=1e-06, initializer_range=0.02, num_labels=1, summary_type='cls_index', summary_use_proj=True, summary_activation=None, summary_proj_to_labels=True, summary_first_dropout=0.1, **kwargs)[source]

Configuration class to store the configuration of a CTRLModel.

Parameters
  • vocab_size_or_config_json_file – Vocabulary size of inputs_ids in CTRLModel or a configuration json file.

  • n_positions – Number of positional embeddings.

  • n_ctx – Size of the causal mask (usually same as n_positions).

  • dff – Size of the inner dimension of the FFN.

  • n_embd – Dimensionality of the embeddings and hidden states.

  • n_layer – Number of hidden layers in the Transformer encoder.

  • n_head – Number of attention heads for each attention layer in the Transformer encoder.

  • layer_norm_epsilon – epsilon to use in the layer norm layers

  • resid_pdrop – The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.

  • attn_pdrop – The dropout ratio for the attention probabilities.

  • embd_pdrop – The dropout ratio for the embeddings.

  • initializer_range – The sttdev of the truncated_normal_initializer for initializing all weight matrices.

CTRLTokenizer

class transformers.CTRLTokenizer(vocab_file, merges_file, unk_token='<unk>', **kwargs)[source]
CTRL BPE tokenizer. Peculiarities:
  • Byte-Pair-Encoding

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (string) in a single string.

save_vocabulary(save_directory)[source]

Save the tokenizer vocabulary and merge files to a directory.

property vocab_size

Size of the base vocabulary (without the added tokens)

CTRLModel

class transformers.CTRLModel(config)[source]

The bare CTRL Model transformer outputting raw hidden-states without any specific head on top. CTRL model was proposed in CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config (CTRLConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Inputs:
input_ids: torch.LongTensor of shape (batch_size, sequence_length):

Indices of input sequence tokens in the vocabulary. CTRL is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left. Indices can be obtained using transformers.CTRLTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

past:

list of torch.FloatTensor (one for each layer): that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

attention_mask: (optional) torch.FloatTensor of shape (batch_size, sequence_length):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

token_type_ids: (optional) torch.LongTensor of shape (batch_size, sequence_length):

A parallel sequence of tokens (can be used to indicate various portions of the inputs). The embeddings from these tokens will be summed with the respective token embeddings. Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).

position_ids: (optional) torch.LongTensor of shape (batch_size, sequence_length):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

head_mask: (optional) torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads):

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

inputs_embeds: (optional) torch.FloatTensor of shape (batch_size, sequence_length, embedding_dim):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:
last_hidden_state: torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the last layer of the model.

past:

list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): that contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

hidden_states: (optional, returned when config.output_hidden_states=True)

list of torch.FloatTensor (one for the output of each layer + the output of the embeddings) of shape (batch_size, sequence_length, hidden_size): Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True)

list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
model = CTRLModel.from_pretrained('ctrl')
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
forward(input_ids=None, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_embeddings()[source]

Get model’s input embeddings

set_input_embeddings(new_embeddings)[source]

Set model’s input embeddings

CTRLLMHeadModel

class transformers.CTRLLMHeadModel(config)[source]

The CTRL Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings). CTRL model was proposed in

CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config (CTRLConfig): Model configuration class with all the parameters of the model.

Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Inputs:
input_ids: torch.LongTensor of shape (batch_size, sequence_length):

Indices of input sequence tokens in the vocabulary. CTRL is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left. Indices can be obtained using transformers.CTRLTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

past:

list of torch.FloatTensor (one for each layer): that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

attention_mask: (optional) torch.FloatTensor of shape (batch_size, sequence_length):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

token_type_ids: (optional) torch.LongTensor of shape (batch_size, sequence_length):

A parallel sequence of tokens (can be used to indicate various portions of the inputs). The embeddings from these tokens will be summed with the respective token embeddings. Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).

position_ids: (optional) torch.LongTensor of shape (batch_size, sequence_length):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

head_mask: (optional) torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads):

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

inputs_embeds: (optional) torch.FloatTensor of shape (batch_size, sequence_length, embedding_dim):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

labels: (optional) torch.LongTensor of shape (batch_size, sequence_length):

Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set lm_labels = input_ids Indices are selected in [-1, 0, ..., config.vocab_size] All labels set to -1 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:
loss: (optional, returned when labels is provided) torch.FloatTensor of shape (1,):

Language modeling loss.

prediction_scores: torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

past:

list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): that contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.

hidden_states: (optional, returned when config.output_hidden_states=True)

list of torch.FloatTensor (one for the output of each layer + the output of the embeddings) of shape (batch_size, sequence_length, hidden_size): Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True)

list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

import torch
from transformers import CTRLTokenizer, CTRLLMHeadModel

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
model = CTRLLMHeadModel.from_pretrained('ctrl')

input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, labels=input_ids)
loss, logits = outputs[:2]
forward(input_ids=None, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_output_embeddings()[source]

Get model’s output embeddings Return None if the model doesn’t have output embeddings

TFCTRLModel

class transformers.TFCTRLModel(config, *inputs, **kwargs)[source]

The bare CTRL Model transformer outputting raw hidden-states without any specific head on top. CTRL model was proposed in CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config (CTRLConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Inputs:
input_ids: Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Indices of input sequence tokens in the vocabulary. CTRL is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left. Indices can be obtained using transformers.CTRLTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

past:

list of Numpy array or tf.Tensor (one for each layer): that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding.

attention_mask: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

token_type_ids: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

A parallel sequence of tokens (can be used to indicate various portions of the inputs). The embeddings from these tokens will be summed with the respective token embeddings. Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).

position_ids: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

head_mask: (optional) Numpy array or tf.Tensor of shape (num_heads,) or (num_layers, num_heads):

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

inputs_embeds: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length, embedding_dim):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:
last_hidden_state: tf.Tensor of shape (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the last layer of the model.

past:

list of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): that contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding.

hidden_states: (optional, returned when config.output_hidden_states=True)

list of tf.Tensor (one for the output of each layer + the output of the embeddings) of shape (batch_size, sequence_length, hidden_size): Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True)

list of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

import tensorflow as tf
from transformers import CTRLTokenizer, TFCTRLModel

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
model = TFCTRLModel.from_pretrained('ctrl')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
call(inputs, **kwargs)[source]

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

TFCTRLLMHeadModel

class transformers.TFCTRLLMHeadModel(config, *inputs, **kwargs)[source]

The CTRL Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings). CTRL model was proposed in

CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config (CTRLConfig): Model configuration class with all the parameters of the model.

Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Inputs:
input_ids: Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Indices of input sequence tokens in the vocabulary. CTRL is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left. Indices can be obtained using transformers.CTRLTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

past:

list of Numpy array or tf.Tensor (one for each layer): that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see past output below). Can be used to speed up sequential decoding.

attention_mask: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

token_type_ids: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

A parallel sequence of tokens (can be used to indicate various portions of the inputs). The embeddings from these tokens will be summed with the respective token embeddings. Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).

position_ids: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

head_mask: (optional) Numpy array or tf.Tensor of shape (num_heads,) or (num_layers, num_heads):

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

inputs_embeds: (optional) Numpy array or tf.Tensor of shape (batch_size, sequence_length, embedding_dim):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:
prediction_scores: torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

past:

list of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): that contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see past input) to speed up sequential decoding.

hidden_states: (optional, returned when config.output_hidden_states=True)

list of tf.Tensor (one for the output of each layer + the output of the embeddings) of shape (batch_size, sequence_length, hidden_size): Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True)

list of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length): Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

import torch
from transformers import CTRLTokenizer, TFCTRLLMHeadModel

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
model = TFCTRLLMHeadModel.from_pretrained('ctrl')

input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, labels=input_ids)
loss, logits = outputs[:2]
call(inputs, **kwargs)[source]

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

get_output_embeddings()[source]

Get model’s output embeddings Return None if the model doesn’t have output embeddings