GPTSAN-japanese
Overview
The GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama).
GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
Usage example
The generate()
method can be used to generate text using GPTSAN-Japanese model.
>>> from transformers import AutoModel, AutoTokenizer
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").cuda()
>>> x_tok = tokenizer("は、", prefix_text="織田信長", return_tensors="pt")
>>> torch.manual_seed(0)
>>> gen_tok = model.generate(x_tok.input_ids.cuda(), token_type_ids=x_tok.token_type_ids.cuda(), max_new_tokens=20)
>>> tokenizer.decode(gen_tok[0])
'織田信長は、2004年に『戦国BASARA』のために、豊臣秀吉'
GPTSAN Features
GPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models. The Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text. GPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original GPTSAN repository for details.
Prefix-LM Model
GPTSAN has the structure of the model named Prefix-LM in the T5
paper. (The original GPTSAN repository calls it hybrid
)
In GPTSAN, the Prefix
part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length.
Arbitrary lengths can also be specified differently for each batch.
This length applies to the text entered in prefix_text
for the tokenizer.
The tokenizer returns the mask of the Prefix
part of Prefix-LM as token_type_ids
.
The model treats the part where token_type_ids
is 1 as a Prefix
part, that is, the input can refer to both tokens before and after.
Usage tips
Specifying the Prefix part is done with a mask passed to self-attention. When token_type_ids=None or all zero, it is equivalent to regular causal mask
for example:
x_token = tokenizer(“アイウエ”) input_ids: | SOT | SEG | ア | イ | ウ | エ | token_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 | prefix_lm_mask: SOT | 1 0 0 0 0 0 | SEG | 1 1 0 0 0 0 | ア | 1 1 1 0 0 0 | イ | 1 1 1 1 0 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 1 |
x_token = tokenizer("", prefix_text=“アイウエ”) input_ids: | SOT | ア | イ | ウ | エ | SEG | token_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 | prefix_lm_mask: SOT | 1 1 1 1 1 0 | ア | 1 1 1 1 1 0 | イ | 1 1 1 1 1 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 0 | SEG | 1 1 1 1 1 1 |
x_token = tokenizer(“ウエ”, prefix_text=“アイ”) input_ids: | SOT | ア | イ | SEG | ウ | エ | token_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 | prefix_lm_mask: SOT | 1 1 1 0 0 0 | ア | 1 1 1 0 0 0 | イ | 1 1 1 0 0 0 | SEG | 1 1 1 1 0 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 1 |
Spout Vector
A Spout Vector is a special vector for controlling text generation.
This vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens.
In the pre-trained model published from Tanrei/GPTSAN-japanese
, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention.
The Spout Vector projected by the fully connected layer is split to be passed to all self-attentions.
GPTSanJapaneseConfig
class transformers.GPTSanJapaneseConfig
< source >( vocab_size = 36000 max_position_embeddings = 1280 d_model = 1024 d_ff = 8192 d_ext = 4096 d_spout = 128 num_switch_layers = 10 num_ext_layers = 0 num_heads = 16 num_experts = 16 expert_capacity = 128 dropout_rate = 0.0 layer_norm_epsilon = 1e-05 router_bias = False router_jitter_noise = 0.0 router_dtype = 'float32' router_ignore_padding_tokens = False output_hidden_states = False output_attentions = False initializer_factor = 0.002 output_router_logits = False use_cache = True separator_token_id = 35998 pad_token_id = 35995 eos_token_id = 35999 **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 36000) — Vocabulary size of the GPTSANJapanese model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling GPTSanJapaneseModel. - max_position_embeddings (
int
, optional, defaults to 1280) — The maximum sequence length that this model might ever be used with. Defaults set this to 1280. - d_model (
int
, optional, defaults to 1024) — Size of the encoder layers and the pooler layer. - d_ff (
int
, optional, defaults to 8192) — Size of the intermediate feed forward layer in eachSwitchTransformersBlock
. - d_ext (
int
, optional, defaults to 4096) — Size of the intermediate feed forward layer in each Extra-layers. - d_spout (
int
, optional, defaults to 128) — Size of thespout
vector. - num_switch_layers (
int
, optional, defaults to 10) — Number of layers in the Switch Transformer layer. - num_ext_layers (
int
, optional, defaults to 0) — Number of layers in the Extra-layers. - num_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder. - num_experts (
int
, optional, defaults to 16) — Number of experts for each SwitchTransformer layer. - expert_capacity (
int
, optional, defaults to 128) — Number of tokens that can be stored in each expert. If set to 1, the model will behave like a regular Transformer. - dropout_rate (
float
, optional, defaults to 0.0) — The ratio for all dropout layers. - layer_norm_eps (
float
, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers. - router_bias (
bool
, optional, defaults toFalse
) — Whether to add a bias to the router. - router_jitter_noise (
float
, optional, defaults to 0.0) — Amount of noise to add to the router. Set it to 0.0 during prediction or set small value (usually 1e-2) during training. - router_dtype (
str
, optional, default to"float32"
) — Thedtype
used for the routers. It is preferable to keep thedtype
to"float32"
as specified in the selective precision discussion in the paper. - router_ignore_padding_tokens (
bool
, optional, defaults toFalse
) — Whether to ignore padding tokens when routing. - output_hidden_states (
bool
, optional, default toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. - initializer_factor (
float
, optional, defaults to 0.002) — A factor for initializing all weight matrices. - output_router_logits (
bool
, optional, default toFalse
) — Whether or not to return the router logits of all experts. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models)
This is the configuration class to store the configuration of a GPTSanJapaneseModel. It is used to instantiate a GPTSANJapanese model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GPTSANJapanese Tanrei/GPTSAN-japanese architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
GPTSanJapaneseTokenizer
class transformers.GPTSanJapaneseTokenizer
< source >( vocab_file emoji_file unk_token = '<|nottoken|>' pad_token = '<|separator|>' bos_token = '<|startoftext|>' eos_token = '<|endoftext|>' sep_token = '<|segmenter|>' do_clean_text = False **kwargs )
Parameters
- vocab_file (
str
) — File containing the vocabulary. - emoji_file (
str
) — File containing the emoji. - unk_token (
str
, optional, defaults to"<|nottoken|>"
) — The token used for unknown charactor - pad_token (
str
, optional, defaults to"<|separator|>"
) — The token used for padding - bos_token (
str
, optional, defaults to"<|startoftext|>"
) — The beginning of sequence token. - eos_token (
str
, optional, defaults to"<|endoftext|>"
) — The end of sequence token. - sep_token (
str
, optional, defaults to"<|segmenter|>"
) — A special token to separate token to prefix part and general input part. - do_clean_text (
bool
, optional, defaults toFalse
) — Whether or not to clean text for URL, EMAIL, TEL, Japanese DATE and Japanese PRICE.
This tokenizer is based on GPTNeoXJapaneseTokenizer and has the following modifications
- Decoding byte0~byte255 tokens correctly
- Added bagofword token handling
- Return token_type_ids for Prefix-LM model The bagofword token represents a repetition of the previous token and is converted to 3 consecutive tokens when decoding In addition, the original Japanese special Sub-Word-Encoding has been released in this repository (https://github.com/tanreinama/Japanese-BPEEncoder_V2). The token_type_ids is a mask indicating the prefix input position of the Prefix-LM model. To specify a prefix position, specify a prefix input for prefix_text, or specify a sentence of the prefix part and the part after it as a text pair of batch input.
Example:
>>> from transformers import GPTSanJapaneseTokenizer
>>> tokenizer = GPTSanJapaneseTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> # You can confirm both 慶応 and 慶應 are encoded to 17750
>>> tokenizer("吾輩は猫である🐯。実は慶応(慶應)大学出身")["input_ids"]
[35993, 35998, 34347, 31459, 30647, 31448, 25, 30659, 35729, 35676, 32417, 30647, 17750, 35589, 17750, 35590, 321, 1281]
>>> # Both 慶応 and 慶應 are decoded to 慶応
>>> tokenizer.decode(tokenizer("吾輩は猫である🐯。実は慶応(慶應)大学出身")["input_ids"])
'吾輩は猫である🐯。実は慶応(慶応)大学出身'
Example for Prefix-LM:
>>> from transformers import GPTSanJapaneseTokenizer
>>> tokenizer = GPTSanJapaneseTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> tokenizer("実は慶応(慶應)大学出身", prefix_text="吾輩は猫である🐯。")["input_ids"]
[35993, 34347, 31459, 30647, 31448, 25, 30659, 35729, 35676, 35998, 32417, 30647, 17750, 35589, 17750, 35590, 321, 1281]
>>> # Mask for Prefix-LM inputs
>>> tokenizer("実は慶応(慶應)大学出身", prefix_text="吾輩は猫である🐯。")["token_type_ids"]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Example for batch encode:
>>> from transformers import GPTSanJapaneseTokenizer
>>> tokenizer = GPTSanJapaneseTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> tokenizer([["武田信玄", "は、"], ["織田信長", "の配下の、"]], padding=True)["input_ids"]
[[35993, 8640, 25948, 35998, 30647, 35675, 35999, 35999], [35993, 10382, 9868, 35998, 30646, 9459, 30646, 35675]]
>>> # Mask for Prefix-LM inputs
>>> tokenizer([["武田信玄", "は、"], ["織田信長", "の配下の、"]], padding=True)["token_type_ids"]
[[1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0]]
>>> # Mask for padding
>>> tokenizer([["武田信玄", "は、"], ["織田信長", "の配下の、"]], padding=True)["attention_mask"]
[[1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]]
Converts a sequence of tokens (string) in a single string.
The tokenizer returns token_type_ids as separators between the Prefix part and the rest. token_type_ids is 1 for the Prefix part and 0 for the rest of the token.
Example:
>>> from transformers import GPTSanJapaneseTokenizer
>>> tokenizer = GPTSanJapaneseTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> x_token = tokenizer("アイウエ")
>>> # input_ids: | SOT | SEG | ア | イ | ウ | エ |
>>> # token_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 |
>>> x_token = tokenizer("", prefix_text="アイウエ")
>>> # input_ids: | SOT | ア | イ | ウ | エ | SEG |
>>> # token_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 |
>>> x_token = tokenizer("ウエ", prefix_text="アイ")
>>> # input_ids: | SOT | ア | イ | SEG | ウ | エ |
>>> # token_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 |
GPTSanJapaneseModel
class transformers.GPTSanJapaneseModel
< source >( config: GPTSanJapaneseConfig )
Parameters
- config (GPTSanJapaneseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare GPTSAN-japanese Model transformer outputting raw hidden-states without any specific head on top.
The GPTSAN-japanese model was proposed in General-purpose Swich transformer based Japanese language model
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None spout: Optional = None past_key_values: Optional = None head_mask: Optional = None use_cache: Optional = False inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None output_router_logits: Optional = None num_precontext: Optional = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. GPTSAN-japanese is a model that generates sentence continuations or predicts tokens at mask positions. Special tokens required for inputs to the model are automatically appended. - attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- token_type_ids (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — An input that masks the Prefix part in the Prefix-LM input. Mask values selected in[0, 1]
:- 1 for tokens that are prefix input,
- 0 for tokens that are not-prefix input.
- spout (
torch.Tensor
of shape(batch_size, config.d_spout)
) — This vector is transformed through an 8-layer FFN and can be used instead ofpast_key_values
. - past_key_values (
tuple(tuple(torch.FloatTensor))
of lengthconfig.n_layers
with each tuple having 4 tensors of shape(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
: - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
. Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models. - num_precontext (
torch.LongTensor
of shape(batch_size,1)
) — length ofhybrid
input tokens in the input. Tokens up to this length refer to both front and back like BERT, tokens after that refer only to front like GPT. see also: https://github.com/tanreinama/GPTSAN/blob/main/report/model.md
The GPTSanJapaneseModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
GPTSanJapaneseForConditionalGeneration
class transformers.GPTSanJapaneseForConditionalGeneration
< source >( config: GPTSanJapaneseConfig )
Parameters
- config (GPTSanJapaneseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare GPTSAN-japanese Model with a language modeling head.
The GPTSAN-japanese model was proposed in General-purpose Swich transformer based Japanese language model
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None spout: Optional = None past_key_values: Optional = None head_mask: Optional = None use_cache: Optional = False inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None output_router_logits: Optional = None labels: Optional = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. GPTSAN-japanese is a model that generates sentence continuations or predicts tokens at mask positions. Special tokens required for inputs to the model are automatically appended. - attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- token_type_ids (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — An input that masks the Prefix part in the Prefix-LM input. Mask values selected in[0, 1]
:- 1 for tokens that are prefix input,
- 0 for tokens that are not-prefix input.
- spout (
torch.Tensor
of shape(batch_size, config.d_spout)
) — This vector is transformed through an 8-layer FFN and can be used instead ofpast_key_values
. - past_key_values (
tuple(tuple(torch.FloatTensor))
of lengthconfig.n_layers
with each tuple having 4 tensors of shape(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
: - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
. Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models. - labels (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for computing the sequence classification loss. Indices should be in[-100, 0, ..., config.vocab_size - 1]
. All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size]
The GPTSanJapaneseForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
Text Generation with regular LM Model
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
>>> device = "cuda"
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> x_token = tokenizer("織田信長は、", return_tensors="pt")
>>> trainer_utils.set_seed(30)
>>> input_ids = x_token.input_ids.to(device)
>>> gen_token = model.generate(input_ids, max_new_tokens=50)
>>> tokenizer.decode(gen_token[0])
"織田信長は、政治・軍事の中枢まで掌握した政治家であり、日本史上類を見ない驚異的な軍事侵攻を続け..."
Text Generation with Prefix-LM Model
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
>>> device = "cuda"
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> x_token = tokenizer("", prefix_text="織田信長は、", return_tensors="pt")
>>> trainer_utils.set_seed(30)
>>> input_ids = x_token.input_ids.to(device)
>>> token_type_ids = x_token.token_type_ids.to(device)
>>> gen_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
>>> tokenizer.decode(gen_token[0])
"織田信長は、政治・外交で数々の戦果を上げるが、1568年からは、いわゆる本能寺の変で細川晴元に暗殺される..."
Simultaneously Text Generation And Masked Language Model
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
>>> device = "cuda"
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
>>> masked_sentence = "武田信玄は、<|inputmask|>時代ファンならぜひ押さえ<|inputmask|>きたい名将の一人。"
>>> x_token = tokenizer("", prefix_text=masked_sentence, return_tensors="pt")
>>> trainer_utils.set_seed(30)
>>> input_ids = x_token.input_ids.to(device)
>>> token_type_ids = x_token.token_type_ids.to(device)
>>> out_lm_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
>>> out_mlm_token = model(input_ids, token_type_ids=token_type_ids).logits.argmax(axis=-1)
>>> tokenizer.decode(out_mlm_token[0])
"武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。"
>>> tokenizer.decode(out_lm_token[0][input_ids.shape[1] :])
"武田氏の三代に渡った武田家のひとり\n甲斐市に住む、日本史上最大の戦国大名。..."