|
--- |
|
license: mit |
|
datasets: |
|
- oscar |
|
language: |
|
- am |
|
library_name: transformers |
|
--- |
|
# Amharic BPE Tokenizer |
|
This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`. |
|
|
|
# How to use |
|
You can load the tokenizer from huggingface hub as follows. |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer") |
|
tokenizer("α α£αα α«αα¨ α¨ααα α²α¬α΅ α₯α½ααααα’") |
|
``` |