rasyosef's picture
Create README.md
eb373ec verified
---
license: mit
datasets:
- oscar
- mc4
language:
- am
library_name: transformers
---
# Amharic WordPiece Tokenizer
This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.
# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α‹¨α‹“αˆˆαˆαŠ α‰€α‰ ነጻ αŠ•αŒα‹΅ αˆ˜αˆ΅α‹α‹α‰΅ α‹΅αˆ…αŠα‰΅αŠ• αˆˆαˆ›αˆΈαŠα α‰ αˆšα‹°αˆ¨αŒˆα‹ α‰΅αŒαˆ αŠ αŠ•α‹± αŒ α‰ƒαˆš መሣαˆͺα‹« αˆŠαˆ†αŠ• αˆ˜α‰»αˆ‰ α‰₯α‹™ α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅ αŒ‰α‹³α‹­ αŠα‹α’")
```
Output:
```python
['α‹¨α‹“αˆˆαˆ', '##αŠ α‰€α‰', 'ነጻ', 'αŠ•αŒα‹΅', 'αˆ˜αˆ΅α‹α‹α‰΅', 'α‹΅αˆ…αŠα‰΅αŠ•', 'αˆˆαˆ›αˆΈαŠα', 'α‰ αˆšα‹°αˆ¨αŒˆα‹', 'α‰΅αŒαˆ', 'αŠ αŠ•α‹±', 'αŒ α‰ƒαˆš', 'መሣαˆͺα‹«', 'αˆŠαˆ†αŠ•', 'αˆ˜α‰»αˆ‰', 'α‰₯α‹™', 'α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅', 'αŒ‰α‹³α‹­', 'αŠα‹', 'ፒ']
```