julien-c's picture
julien-c HF staff
Fix `language` tag
3211367
|
raw
history blame
845 Bytes
metadata
language:
  - en
  - ja
license: mit
datasets:
  - snow_simplified_japanese_corpus
tags:
  - ja
  - japanese
  - tokenizer
widget:
  - text: 誰が一番に着くか私には分かりません。

Japanese Dummy Tokenizer

Repository containing a dummy Japanese Tokenizer trained on snow_simplified_japanese_corpus dataset. The tokenizer has been trained using Hugging Face datasets in a streaming manner.

Intended uses & limitations

You can use this tokenizer to tokenize Japanese sentences.

How to use it

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ybelkada/japanese-dummy-tokenizer")

How to train the tokenizer

Check the file tokenizer.py, you can freely adapt it to other datasets. This tokenizer is based on the tokenizer from csebuetnlp/mT5_multilingual_XLSum.