Spaces:
Running
Running
PARALLELIZE_DOCSTRING = r""" | |
This is an experimental feature and is a subject to change at a moment's notice. | |
Uses a device map to distribute attention modules of the model across several devices. If no device map is given, | |
it will evenly distribute blocks across all devices. | |
Args: | |
device_map (`Dict[int, list]`, optional, defaults to None): | |
A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always | |
automatically mapped to the first device (for esoteric reasons). That means that the first device should | |
have fewer attention modules mapped to it than other devices. For reference, the mt5 models have the | |
following number of attention modules: | |
- mt5-small: 6 | |
- mt5-base: 12 | |
- mt5-large: 24 | |
- mt5-xl: 24 | |
- mt5-xxl: 24 | |
Example: | |
```python | |
# Here is an example of a device map on a machine with 4 GPUs using mt5-xl, which has a total of 24 attention modules: | |
model = MT5ForConditionalGeneration.from_pretrained("mt5-xl") | |
device_map = { | |
0: [0, 1, 2], | |
1: [3, 4, 5, 6, 7, 8, 9], | |
2: [10, 11, 12, 13, 14, 15, 16], | |
3: [17, 18, 19, 20, 21, 22, 23], | |
} | |
model.parallelize(device_map) | |
``` | |
""" | |
DEPARALLELIZE_DOCSTRING = r""" | |
Moves the model to cpu from a model parallel state. | |
Example: | |
```python | |
# On a 4 GPU machine with mt5-xl: | |
model = MT5ForConditionalGeneration.from_pretrained("Mt5-xl") | |
device_map = { | |
0: [0, 1, 2], | |
1: [3, 4, 5, 6, 7, 8, 9], | |
2: [10, 11, 12, 13, 14, 15, 16], | |
3: [17, 18, 19, 20, 21, 22, 23], | |
} | |
model.parallelize(device_map) # Splits the model across several devices | |
model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache() | |
``` | |
""" | |
__HEAD_MASK_WARNING_MSG = """ | |
The input argument `head_mask` was split into two arguments `head_mask` and `decoder_head_mask`. Currently, | |
`decoder_head_mask` is set to copy `head_mask`, but this feature is deprecated and will be removed in future versions. | |
If you do not want to use any `decoder_head_mask` now, please set `decoder_head_mask = torch.ones(num_layers, | |
num_heads)`. | |
""" | |
MT5_START_DOCSTRING = r""" | |
The MT5 model was proposed in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text | |
Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan | |
Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a | |
text-to-text denoising generative setting. | |
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the | |
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
etc.) | |
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
and behavior. | |
Parameters: | |
config ([`MT5Config`]): Model configuration class with all the parameters of the model. | |
Initializing with a config file does not load the weights associated with the model, only the | |
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. | |
""" | |
MT5_INPUTS_DOCSTRING = r""" | |
Args: | |
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): | |
Indices of input sequence tokens in the vocabulary. MT5 is a model with relative position embeddings so you | |
should be able to pad the inputs on both the right and the left. | |
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and | |
[`PreTrainedTokenizer.__call__`] for detail. | |
[What are input IDs?](../glossary#input-ids) | |
To know more on how to prepare `input_ids` for pretraining take a look a [MT5 Training](./mt5#training). | |
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): | |
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
- 1 for tokens that are **not masked**, | |
- 0 for tokens that are **masked**. | |
[What are attention masks?](../glossary#attention-mask) | |
decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): | |
Indices of decoder input sequence tokens in the vocabulary. | |
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and | |
[`PreTrainedTokenizer.__call__`] for details. | |
[What are decoder input IDs?](../glossary#decoder-input-ids) | |
MT5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values` | |
is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`). | |
To know more on how to prepare `decoder_input_ids` for pretraining take a look at [MT5 | |
Training](./mt5#training). | |
decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): | |
Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also | |
be used by default. | |
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): | |
Mask to nullify selected heads of the self-attention modules in the encoder. Mask values selected in `[0, | |
1]`: | |
- 1 indicates the head is **not masked**, | |
- 0 indicates the head is **masked**. | |
decoder_head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): | |
Mask to nullify selected heads of the self-attention modules in the decoder. Mask values selected in `[0, | |
1]`: | |
- 1 indicates the head is **not masked**, | |
- 0 indicates the head is **masked**. | |
cross_attn_head_mask (`torch.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): | |
Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in | |
`[0, 1]`: | |
- 1 indicates the head is **not masked**, | |
- 0 indicates the head is **masked**. | |
encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*): | |
Tuple consists of (`last_hidden_state`, `optional`: *hidden_states*, `optional`: *attentions*) | |
`last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` is a sequence of hidden states at | |
the output of the last layer of the encoder. Used in the cross-attention of the decoder. | |
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): | |
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. | |
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that | |
don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all | |
`decoder_input_ids` of shape `(batch_size, sequence_length)`. | |
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): | |
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | |
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | |
model's internal embedding lookup matrix. | |
decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): | |
Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded | |
representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be | |
input (see `past_key_values`). This is useful if you want more control over how to convert | |
`decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. | |
If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value | |
of `inputs_embeds`. | |
use_cache (`bool`, *optional*): | |
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see | |
`past_key_values`). | |
output_attentions (`bool`, *optional*): | |
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
tensors for more detail. | |
output_hidden_states (`bool`, *optional*): | |
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
more detail. | |
return_dict (`bool`, *optional*): | |
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. | |
""" | |
MT5_ENCODER_INPUTS_DOCSTRING = r""" | |
Args: | |
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): | |
Indices of input sequence tokens in the vocabulary. MT5 is a model with relative position embeddings so you | |
should be able to pad the inputs on both the right and the left. | |
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and | |
[`PreTrainedTokenizer.__call__`] for detail. | |
To know more on how to prepare `input_ids` for pretraining take a look a [MT5 Training](./mt5#training). | |
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): | |
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
- 1 for tokens that are **not masked**, | |
- 0 for tokens that are **masked**. | |
[What are attention masks?](../glossary#attention-mask) | |
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): | |
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: | |
- 1 indicates the head is **not masked**, | |
- 0 indicates the head is **masked**. | |
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): | |
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | |
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | |
model's internal embedding lookup matrix. | |
output_attentions (`bool`, *optional*): | |
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
tensors for more detail. | |
output_hidden_states (`bool`, *optional*): | |
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
more detail. | |
return_dict (`bool`, *optional*): | |
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. | |
""" | |
# Warning message for FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask | |
__HEAD_MASK_WARNING_MSG = """ | |
The input argument `head_mask` was split into two arguments `head_mask` and `decoder_head_mask`. Currently, | |
`decoder_head_mask` is set to copy `head_mask`, but this feature is deprecated and will be removed in future versions. | |
If you do not want to use any `decoder_head_mask` now, please set `decoder_head_mask = torch.ones(num_layers, | |
num_heads)`. | |
""" |