--- language: - en - ja license: mit datasets: - snow_simplified_japanese_corpus tags: - ja - japanese - tokenizer widget: - text: "誰が一番に着くか私には分かりません。" --- # Japanese Dummy Tokenizer Repository containing a dummy Japanese Tokenizer trained on ```snow_simplified_japanese_corpus``` dataset. The tokenizer has been trained using Hugging Face datasets in a streaming manner. ## Intended uses & limitations You can use this tokenizer to tokenize Japanese sentences. ## How to use it ``` from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("ybelkada/japanese-dummy-tokenizer") ``` ## How to train the tokenizer Check the file ```tokenizer.py```, you can freely adapt it to other datasets. This tokenizer is based on the tokenizer from ```csebuetnlp/mT5_multilingual_XLSum```.