File size: 666 Bytes

6bbb3d0
 
 
 
 
 
 
 
 
52c136b
6bbb3d0
 
 
 
 
 
 
5b389e4
6bbb3d0

---
license: mit
datasets:
- oscar
language:
- am
library_name: transformers
---
# Amharic BPE Tokenizer
This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`.

# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("አባይን ያላየ የፕሌን ቲኬት እችለዋለው።")
```