Amharic WordPiece Tokenizer

This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 30522.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α‹¨α‹“αˆˆαˆαŠ α‰€α‰ ነጻ αŠ•αŒα‹΅ αˆ˜αˆ΅α‹α‹α‰΅ α‹΅αˆ…αŠα‰΅αŠ• αˆˆαˆ›αˆΈαŠα α‰ αˆšα‹°αˆ¨αŒˆα‹ α‰΅αŒαˆ αŠ αŠ•α‹± αŒ α‰ƒαˆš መሣαˆͺα‹« αˆŠαˆ†αŠ• αˆ˜α‰»αˆ‰ α‰₯α‹™ α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅ αŒ‰α‹³α‹­ αŠα‹α’")

Output:

['α‹¨α‹“αˆˆαˆ', '##αŠ α‰€α‰', 'ነጻ', 'αŠ•αŒα‹΅', 'αˆ˜αˆ΅α‹α‹α‰΅', 'α‹΅αˆ…αŠα‰΅αŠ•', 'αˆˆαˆ›αˆΈαŠα', 'α‰ αˆšα‹°αˆ¨αŒˆα‹', 'α‰΅αŒαˆ', 'αŠ αŠ•α‹±', 'αŒ α‰ƒαˆš', 'መሣαˆͺα‹«', 'αˆŠαˆ†αŠ•', 'αˆ˜α‰»αˆ‰', 'α‰₯α‹™', 'α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅', 'αŒ‰α‹³α‹­', 'αŠα‹', 'ፒ']
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Datasets used to train rasyosef/bert-amharic-tokenizer