Transformers documentation

BertJapanese

You are viewing v4.23.1 version. A newer version v4.46.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

BertJapanese

Overview

The BERT models trained on Japanese text.

There are models with two different tokenization methods:

  • Tokenize with MeCab and WordPiece. This requires some extra dependencies, fugashi which is a wrapper around MeCab.
  • Tokenize into characters.

To use MecabTokenizer, you should pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install dependencies.

See details on cl-tohoku repository.

Example of using a model with MeCab and WordPiece tokenization:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾輩 は 猫 で ある 。 [SEP]

>>> outputs = bertjapanese(**inputs)

Example of using a model with Character tokenization:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]

>>> outputs = bertjapanese(**inputs)

Tips:

  • This implementation is the same as BERT, except for tokenization method. Refer to the documentation of BERT for more usage examples.

This model was contributed by cl-tohoku.

BertJapaneseTokenizer

class transformers.BertJapaneseTokenizer

< >

( vocab_file do_lower_case = False do_word_tokenize = True do_subword_tokenize = True word_tokenizer_type = 'basic' subword_tokenizer_type = 'wordpiece' never_split = None unk_token = '[UNK]' sep_token = '[SEP]' pad_token = '[PAD]' cls_token = '[CLS]' mask_token = '[MASK]' mecab_kwargs = None sudachi_kwargs = None jumanpp_kwargs = None **kwargs )

Parameters

  • vocab_file (str) — Path to a one-wordpiece-per-line vocabulary file.
  • do_lower_case (bool, optional, defaults to True) — Whether to lower case the input. Only has an effect when do_basic_tokenize=True.
  • do_word_tokenize (bool, optional, defaults to True) — Whether to do word tokenization.
  • do_subword_tokenize (bool, optional, defaults to True) — Whether to do subword tokenization.
  • word_tokenizer_type (str, optional, defaults to "basic") — Type of word tokenizer. Choose from [“basic”, “mecab”, “sudachi”, “jumanpp”].
  • subword_tokenizer_type (str, optional, defaults to "wordpiece") — Type of subword tokenizer. Choose from [“wordpiece”, “character”].
  • mecab_kwargs (dict, optional) — Dictionary passed to the MecabTokenizer constructor.
  • sudachi_kwargs (dict, optional) — Dictionary passed to the SudachiTokenizer constructor.
  • jumanpp_kwargs (dict, optional) — Dictionary passed to the JumanppTokenizer constructor.

Construct a BERT tokenizer for Japanese text.