Getok: Custom Tokenizer for Tibetan Buddhist Texts

This is a custom Byte Pair Encoding (BPE) tokenizer specifically for Tibetan Buddhist texts. It was trained using the SentencePiece / Hugging Face Tokenizers library. It is designed to tokenize text data efficiently for downstream NLP tasks. The tokenizer supports Unicode text in both Tibetan and English and was trained on a domain-specific corpus of Tibetan Buddhist texts.

This model was developed as part of the MLotsawa project. More information can be found here.

Special thanks to Andres Montano for suggesting the name of this tokenizer.

Use Cases

Preprocessing for text classification, translation, summarization, or language modeling tasks
Training/fine-tuning language models for Tibetan Buddhism related tasks.

Details

Tokenizer Type: BPE (Byte Pair Encoding)

Vocabulary Size: 32,000

Normalization: None

Special Tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]", ""

Tokenization Level: Subword

Languages: Tibetan, English

Training Data

The tokenizer was trained on a corpus consisting of:

879,132 sentences of Tibetan from Buddhist texts
879,132 sentences of English from translations of the Tibetan texts

Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('billingsmoore/getok-v0')

tokenizer.encode('འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔')

Limitations

The tokenizer currently supports unicode text in Tibetan or Latin script. However, it was only trained on Tibetan and English texts and should not be expected to perform well on other languages that use those scripts (i.e. Dzongkha, French)

This tokenizer is not suitable for languages that are written in other scripts (i.e. Greek, Russian)

Finetuning a pretrained model using this tokenizer should be expected to take longer than finetuning using the model's own tokenizer because the model will need to adapt to the new encodings.

Author & Contact

Author: billingsmoore

Contact: billingsmoore [at] gmail [dot] com

billingsmoore
/

getok-v0