File size: 666 Bytes
6bbb3d0
 
 
 
 
 
 
 
 
52c136b
6bbb3d0
 
 
 
 
 
 
5b389e4
6bbb3d0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
license: mit
datasets:
- oscar
language:
- am
library_name: transformers
---
# Amharic BPE Tokenizer
This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`.

# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("αŠ α‰£α‹­αŠ• α‹«αˆ‹α‹¨ α‹¨α•αˆŒαŠ• α‰²αŠ¬α‰΅ αŠ₯α‰½αˆˆα‹‹αˆˆα‹α’")
```