rasyosef
/

bert-amharic-tokenizer

Inference Endpoints

Model card Files Files and versions Community

rasyosef commited on Feb 8

Commit

eb373ec

•

1 Parent(s): 0b8cdeb

Create README.md

Files changed (1) hide show

README.md +25 -0

README.md ADDED Viewed

	@@ -0,0 +1,25 @@

+---
+license: mit
+datasets:
+- oscar
+- mc4
+language:
+- am
+library_name: transformers
+---
+# Amharic WordPiece Tokenizer
+This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.
+# How to use
+You can load the tokenizer from huggingface hub as follows.
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
+tokenizer.tokenize("የዓለምአቀፉ ነጻ ንግድ መስፋፋት ድህነትን ለማሸነፍ በሚደረገው ትግል አንዱ ጠቃሚ መሣሪያ ሊሆን መቻሉ ብዙ የሚነገርለት ጉዳይ ነው።")
+```
+Output:
+```python
+['የዓለም', '##አቀፉ', 'ነጻ', 'ንግድ', 'መስፋፋት', 'ድህነትን', 'ለማሸነፍ', 'በሚደረገው', 'ትግል', 'አንዱ', 'ጠቃሚ', 'መሣሪያ', 'ሊሆን', 'መቻሉ', 'ብዙ', 'የሚነገርለት', 'ጉዳይ', 'ነው', '።']
+```