Amharic BPE Tokenizer
This repo contains a Byte-Pair Encoding tokenizer trained on the Amharic subset of the oscar dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 24000
.
How to use
You can load the tokenizer from huggingface hub as follows.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("አባይን ያላየ የፕሌን ቲኬት እችለዋለው።")
- Downloads last month
- 0
Unable to determine this model’s pipeline type. Check the
docs
.