Getok: Custom Tokenizer for Tibetan Buddhist Texts
This is a custom Byte Pair Encoding (BPE) tokenizer specifically for Tibetan Buddhist texts. It was trained using the SentencePiece / Hugging Face Tokenizers library. It is designed to tokenize text data efficiently for downstream NLP tasks. The tokenizer supports Unicode text in both Tibetan and English and was trained on a domain-specific corpus of Tibetan Buddhist texts.
This model was developed as part of the MLotsawa project. More information can be found here.
Special thanks to Andres Montano for suggesting the name of this tokenizer.
Use Cases
- Preprocessing for text classification, translation, summarization, or language modeling tasks
- Training/fine-tuning language models for Tibetan Buddhism related tasks.
Details
Tokenizer Type: BPE (Byte Pair Encoding)
Vocabulary Size: 32,000
Normalization: None
Special Tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]", ""
Tokenization Level: Subword
Languages: Tibetan, English
Training Data
The tokenizer was trained on a corpus consisting of:
- 879,132 sentences of Tibetan from Buddhist texts
- 879,132 sentences of English from translations of the Tibetan texts
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained('billingsmoore/getok-v0')
tokenizer.encode('འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔')
Limitations
The tokenizer currently supports unicode text in Tibetan or Latin script. However, it was only trained on Tibetan and English texts and should not be expected to perform well on other languages that use those scripts (i.e. Dzongkha, French)
This tokenizer is not suitable for languages that are written in other scripts (i.e. Greek, Russian)
Finetuning a pretrained model using this tokenizer should be expected to take longer than finetuning using the model's own tokenizer because the model will need to adapt to the new encodings.
Author & Contact
Author: billingsmoore
Contact: billingsmoore [at] gmail [dot] com