Transformers documentation

BertJapanese

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.40.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

BertJapanese

Overview

BERT モデルは日本語テキストでトレーニングされました。

2 つの異なるトークン化方法を備えたモデルがあります。

  • MeCab と WordPiece を使用してトークン化します。これには、MeCab のラッパーである fugashi という追加の依存関係が必要です。
  • 文字にトークン化します。

MecabTokenizer を使用するには、pip installTransformers["ja"] (または、インストールする場合は pip install -e .["ja"]) する必要があります。 ソースから)依存関係をインストールします。

cl-tohakuリポジトリの詳細を参照してください。

MeCab および WordPiece トークン化でモデルを使用する例:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾輩 は 猫 で ある 。 [SEP]

>>> outputs = bertjapanese(**inputs)

文字トークン化を使用したモデルの使用例:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]

>>> outputs = bertjapanese(**inputs)
  • この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、BERT のドキュメント を参照してください。

このモデルはcl-tohakuから提供されました。

BertJapaneseTokenizer

class transformers.BertJapaneseTokenizer

< >

( vocab_file spm_file = None do_lower_case = False do_word_tokenize = True do_subword_tokenize = True word_tokenizer_type = 'basic' subword_tokenizer_type = 'wordpiece' never_split = None unk_token = '[UNK]' sep_token = '[SEP]' pad_token = '[PAD]' cls_token = '[CLS]' mask_token = '[MASK]' mecab_kwargs = None sudachi_kwargs = None jumanpp_kwargs = None **kwargs )

Parameters

  • vocab_file (str) — Path to a one-wordpiece-per-line vocabulary file.
  • spm_file (str, optional) — Path to SentencePiece file (generally has a .spm or .model extension) that contains the vocabulary.
  • do_lower_case (bool, optional, defaults to True) — Whether to lower case the input. Only has an effect when do_basic_tokenize=True.
  • do_word_tokenize (bool, optional, defaults to True) — Whether to do word tokenization.
  • do_subword_tokenize (bool, optional, defaults to True) — Whether to do subword tokenization.
  • word_tokenizer_type (str, optional, defaults to "basic") — Type of word tokenizer. Choose from [“basic”, “mecab”, “sudachi”, “jumanpp”].
  • subword_tokenizer_type (str, optional, defaults to "wordpiece") — Type of subword tokenizer. Choose from [“wordpiece”, “character”, “sentencepiece”,].
  • mecab_kwargs (dict, optional) — Dictionary passed to the MecabTokenizer constructor.
  • sudachi_kwargs (dict, optional) — Dictionary passed to the SudachiTokenizer constructor.
  • jumanpp_kwargs (dict, optional) — Dictionary passed to the JumanppTokenizer constructor.

Construct a BERT tokenizer for Japanese text.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to: this superclass for more information regarding those methods.

build_inputs_with_special_tokens

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs to which the special tokens will be added.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

  • single sequence: [CLS] X [SEP]
  • pair of sequences: [CLS] A [SEP] B [SEP]

convert_tokens_to_string

< >

( tokens )

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of token type IDs according to the given sequence(s).

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence

pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

get_special_tokens_mask

< >

( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.
  • already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

< > Update on GitHub