Transformers documentation

ByT5

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

ByT5

Overview

ByT5 モデルは、ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

論文の要約は次のとおりです。

最も広く使用されている事前トレーニング済み言語モデルは、単語またはサブワード単位に対応するトークンのシーケンスで動作します。テキストをトークンのシーケンスとしてエンコードするには、トークナイザーが必要です。トークナイザーは通常、モデル。代わりに生のテキスト (バイトまたは文字) を直接操作するトークンフリーモデルには多くの利点があります。すぐに使用できるあらゆる言語のテキストを処理でき、ノイズに対してより堅牢であり、技術的負債を最小限に抑えます。複雑でエラーが発生しやすいテキスト前処理パイプラインを削除します。バイトまたは文字列がトークンより長いためトークンフリーモデルに関する過去の研究では、シーケンスのコストを償却するように設計された新しいモデルアーキテクチャが導入されることがよくありました。生のテキストを直接操作します。この論文では、標準的な Transformer アーキテクチャが次のようなもので使用できることを示します。バイトシーケンスを処理するための最小限の変更。パラメータ数の観点からトレードオフを注意深く特徴付けます。 FLOP のトレーニングと推論速度を調べ、バイトレベルのモデルがトークンレベルと競合できることを示します。対応者。また、バイトレベルのモデルはノイズに対して大幅に堅牢であり、より優れたパフォーマンスを発揮することも示しています。スペルと発音に敏感なタスク。私たちの貢献の一環として、新しいセットをリリースします。 T5 アーキテクチャに基づいた事前トレーニング済みのバイトレベルの Transformer モデルと、そこで使用されるすべてのコードとデータ実験。

このモデルは、patrickvonplaten によって提供されました。元のコードは次のとおりですここにあります。

ByT5 のアーキテクチャは T5v1.1 モデルに基づいています。API リファレンスについては、T5v1.1 のドキュメントページを参照してください。彼らはモデルの入力を準備する方法が異なるだけです。以下のコード例を参照してください。

ByT5 は教師なしで事前トレーニングされているため、単一タスク中にタスクプレフィックスを使用する利点はありません。微調整。マルチタスクの微調整を行う場合は、プレフィックスを使用する必要があります。

Usage Examples

ByT5 は生の UTF-8 バイトで動作するため、トークナイザーなしで使用できます。

>>> from transformers import T5ForConditionalGeneration
>>> import torch

>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

>>> num_special_tokens = 3
>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.

>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens

>>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens

>>> loss = model(input_ids, labels=labels).loss
>>> loss.item()
2.66

ただし、バッチ推論とトレーニングの場合は、トークナイザーを使用することをお勧めします。

>>> from transformers import T5ForConditionalGeneration, AutoTokenizer

>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

>>> model_inputs = tokenizer(
...     ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
... )
>>> labels_dict = tokenizer(
...     ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
... )
>>> labels = labels_dict.input_ids

>>> loss = model(**model_inputs, labels=labels).loss
>>> loss.item()
17.9

T5 と同様に、ByT5 はスパンマスクノイズ除去タスクでトレーニングされました。しかし、モデルはキャラクターに直接作用するため、事前トレーニングタスクは少し複雑です違う。のいくつかの文字を破損してみましょう "The dog chases a ball in the park."という文を入力し、ByT5 に予測してもらいます。わたしたちのため。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")

>>> input_ids_prompt = "The dog chases a ball in the park."
>>> input_ids = tokenizer(input_ids_prompt).input_ids

>>> # Note that we cannot add "{extra_id_...}" to the string directly
>>> # as the Byte tokenizer would incorrectly merge the tokens
>>> # For ByT5, we need to work directly on the character level
>>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
>>> # uses final utf character ids.
>>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
>>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
>>> # => mask to "The dog [258]a ball [257]park."

>>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
>>> input_ids
tensor([[ 87, 107, 104,  35, 103, 114, 106,  35, 258,  35, 100,  35, 101, 100, 111, 111, 257,  35, 115, 100, 117, 110,  49,   1]])

>>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
>>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
>>> output_ids
[0, 258, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118, 257,  35, 108, 113,  35, 119, 107, 104,  35, 103, 108, 118, 102, 114, 256, 108, 113,  35, 119, 107, 104, 35, 115, 100, 117, 110,  49,  35,  87, 107, 104,  35, 103, 114, 106, 35, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118,  35, 100,  35, 101, 100, 111, 111,  35, 108, 113, 255,  35, 108, 113,  35, 119, 107, 104,  35, 115, 100, 117, 110,  49]

>>> # ^- Note how 258 descends to 257, 256, 255

>>> # Now we need to split on the sentinel tokens, let's write a short loop for this
>>> output_ids_list = []
>>> start_token = 0
>>> sentinel_token = 258
>>> while sentinel_token in output_ids:
...     split_idx = output_ids.index(sentinel_token)
...     output_ids_list.append(output_ids[start_token:split_idx])
...     start_token = split_idx
...     sentinel_token -= 1

>>> output_ids_list.append(output_ids[start_token:])
>>> output_string = tokenizer.batch_decode(output_ids_list)
>>> output_string
['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']

ByT5Tokenizer

class transformers.ByT5Tokenizer

< source >

( eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' extra_ids = 125 additional_special_tokens = None **kwargs )

Parameters

eos_token (str, optional, defaults to "</s>") — The end of sequence token.

When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.
unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
extra_ids (int, optional, defaults to 125) — Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as “id{%d}>” where ”{%d}” is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (“” is the last token in the vocabulary like in ByT5 preprocessing see here).
additional_special_tokens (list[str], optional) — Additional special tokens used by the tokenizer.

Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens

< source >

( token_ids_0: list token_ids_1: typing.Optional[list[int]] = None ) → list[int]

Parameters

token_ids_0 (list[int]) — List of IDs to which the special tokens will be added.
token_ids_1 (list[int], optional) — Optional second list of IDs for sequence pairs.

Returns

list[int]

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

single sequence: X </s>
pair of sequences: A </s> B </s>

convert_tokens_to_string

< source >

( tokens )

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences

< source >

( token_ids_0: list token_ids_1: typing.Optional[list[int]] = None ) → list[int]

Parameters

token_ids_0 (list[int]) — List of IDs.
token_ids_1 (list[int], optional) — Optional second list of IDs for sequence pairs.

Returns

list[int]

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask

< source >

( token_ids_0: list token_ids_1: typing.Optional[list[int]] = None already_has_special_tokens: bool = False ) → list[int]

Parameters

token_ids_0 (list[int]) — List of IDs.
token_ids_1 (list[int], optional) — Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

list[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

詳細については、ByT5Tokenizer を参照してください。

Update on GitHub

←BORT CamemBERT→