ByT5
Overview
The ByT5 model was presented in ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
The abstract from the paper is the following:
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
This model was contributed by patrickvonplaten. The original code can be found here.
ByT5’s architecture is based on the T5v1.1 model, so one can refer to T5v1.1’s documentation page. They only differ in how inputs should be prepared for the model, see the code examples below.
Since ByT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
Example
ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
from transformers import T5ForConditionalGeneration
import torch
model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
labels = (
torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3
) # add 3 for special tokens
loss = model(input_ids, labels=labels).loss # forward pass
For batched inference and training it is however recommended to make use of the tokenizer:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
model_inputs = tokenizer(
["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
)
labels = tokenizer(
["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
).input_ids
loss = model(**model_inputs, labels=labels).loss # forward pass
ByT5Tokenizer
( eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' extra_ids = 125 additional_special_tokens = None **kwargs )
Parameters
-
eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. -
unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. -
pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. -
extra_ids (
int
, optional, defaults to 100) — Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as “id{%d}>” where ”{%d}” is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (“ ” is the last token in the vocabulary like in ByT5 preprocessing see here). -
additional_special_tokens (
List[str]
, optional) — Additional special tokens used by the tokenizer.
Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
)
→
List[int]
Parameters
-
token_ids_0 (
List[int]
) — List of IDs to which the special tokens will be added. -
token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of input IDs with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:
- single sequence:
X </s>
- pair of sequences:
A </s> B </s>
Converts a sequence of tokens (string) in a single string.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
)
→
List[int]
Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not make use of token type ids, therefore a list of zeros is returned.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
already_has_special_tokens: bool = False
)
→
List[int]
Parameters
-
token_ids_0 (
List[int]
) — List of IDs. -
token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs. -
already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
See ByT5Tokenizer for all details.