Edit model card

Amharic WordPiece Tokenizer

This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic text dataset, with a vocabulary size of 24576.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α‹¨α‹“αˆˆαˆαŠ α‰€α‰ ነጻ αŠ•αŒα‹΅ αˆ˜αˆ΅α‹α‹α‰΅ α‹΅αˆ…αŠα‰΅αŠ• αˆˆαˆ›αˆΈαŠα α‰ αˆšα‹°αˆ¨αŒˆα‹ α‰΅αŒαˆ αŠ αŠ•α‹± αŒ α‰ƒαˆš መሣαˆͺα‹« αˆŠαˆ†αŠ• αˆ˜α‰»αˆ‰ α‰₯α‹™ α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅ αŒ‰α‹³α‹­ αŠα‹α’")

Output:

['α‹¨α‹“αˆˆαˆ', '##አ', '##ቀፉ', 'ነጻ', 'αŠ•αŒα‹΅', 'αˆ˜αˆ΅α‹α‹α‰΅', 'α‹΅αˆ…αŠα‰΅αŠ•', 'αˆˆαˆ›αˆΈαŠα', 'α‰ αˆšα‹°αˆ¨αŒˆα‹', 'α‰΅αŒαˆ', 'αŠ αŠ•α‹±', 'αŒ α‰ƒαˆš', 'መሣαˆͺα‹«', 'αˆŠαˆ†αŠ•', 'αˆ˜α‰»αˆ‰', 'α‰₯α‹™', 'α‹¨αˆšαŠαŒˆαˆ­', '##αˆˆα‰΅', 'αŒ‰α‹³α‹­', 'αŠα‹', 'ፒ']
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train rasyosef/bert-amharic-tokenizer-24k