rasyosef's picture
Create README.md
eb373ec verified
metadata
license: mit
datasets:
  - oscar
  - mc4
language:
  - am
library_name: transformers

Amharic WordPiece Tokenizer

This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 30522.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α‹¨α‹“αˆˆαˆαŠ α‰€α‰ ነጻ αŠ•αŒα‹΅ αˆ˜αˆ΅α‹α‹α‰΅ α‹΅αˆ…αŠα‰΅αŠ• αˆˆαˆ›αˆΈαŠα α‰ αˆšα‹°αˆ¨αŒˆα‹ α‰΅αŒαˆ αŠ αŠ•α‹± αŒ α‰ƒαˆš መሣαˆͺα‹« αˆŠαˆ†αŠ• αˆ˜α‰»αˆ‰ α‰₯α‹™ α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅ αŒ‰α‹³α‹­ αŠα‹α’")

Output:

['α‹¨α‹“αˆˆαˆ', '##αŠ α‰€α‰', 'ነጻ', 'αŠ•αŒα‹΅', 'αˆ˜αˆ΅α‹α‹α‰΅', 'α‹΅αˆ…αŠα‰΅αŠ•', 'αˆˆαˆ›αˆΈαŠα', 'α‰ αˆšα‹°αˆ¨αŒˆα‹', 'α‰΅αŒαˆ', 'αŠ αŠ•α‹±', 'αŒ α‰ƒαˆš', 'መሣαˆͺα‹«', 'αˆŠαˆ†αŠ•', 'αˆ˜α‰»αˆ‰', 'α‰₯α‹™', 'α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅', 'αŒ‰α‹³α‹­', 'αŠα‹', 'ፒ']