--- license: mit datasets: - oscar language: - am library_name: transformers --- # Amharic BPE Tokenizer This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`. # How to use You can load the tokenizer from huggingface hub as follows. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer") tokenizer("አባይን ያላየ የፕሌን ቲኬት እችለዋለው።") ```