rasyosef's picture
Update README.md
52c136b verified
---
license: mit
datasets:
- oscar
language:
- am
library_name: transformers
---
# Amharic BPE Tokenizer
This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`.
# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("αŠ α‰£α‹­αŠ• α‹«αˆ‹α‹¨ α‹¨α•αˆŒαŠ• α‰²αŠ¬α‰΅ αŠ₯α‰½αˆˆα‹‹αˆˆα‹α’")
```