File size: 666 Bytes
6bbb3d0 52c136b 6bbb3d0 5b389e4 6bbb3d0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
---
license: mit
datasets:
- oscar
language:
- am
library_name: transformers
---
# Amharic BPE Tokenizer
This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`.
# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("α α£αα α«αα¨ α¨ααα α²α¬α΅ α₯α½ααααα’")
``` |