rasyosef commited on
Commit
eb373ec
β€’
1 Parent(s): 0b8cdeb

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - oscar
5
+ - mc4
6
+ language:
7
+ - am
8
+ library_name: transformers
9
+ ---
10
+ # Amharic WordPiece Tokenizer
11
+ This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.
12
+
13
+ # How to use
14
+ You can load the tokenizer from huggingface hub as follows.
15
+ ```python
16
+ from transformers import AutoTokenizer
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
19
+ tokenizer.tokenize("α‹¨α‹“αˆˆαˆαŠ α‰€α‰ ነጻ αŠ•αŒα‹΅ αˆ˜αˆ΅α‹α‹α‰΅ α‹΅αˆ…αŠα‰΅αŠ• αˆˆαˆ›αˆΈαŠα α‰ αˆšα‹°αˆ¨αŒˆα‹ α‰΅αŒαˆ αŠ αŠ•α‹± αŒ α‰ƒαˆš መሣαˆͺα‹« αˆŠαˆ†αŠ• αˆ˜α‰»αˆ‰ α‰₯α‹™ α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅ αŒ‰α‹³α‹­ αŠα‹α’")
20
+ ```
21
+
22
+ Output:
23
+ ```python
24
+ ['α‹¨α‹“αˆˆαˆ', '##αŠ α‰€α‰', 'ነጻ', 'αŠ•αŒα‹΅', 'αˆ˜αˆ΅α‹α‹α‰΅', 'α‹΅αˆ…αŠα‰΅αŠ•', 'αˆˆαˆ›αˆΈαŠα', 'α‰ αˆšα‹°αˆ¨αŒˆα‹', 'α‰΅αŒαˆ', 'αŠ αŠ•α‹±', 'αŒ α‰ƒαˆš', 'መሣαˆͺα‹«', 'αˆŠαˆ†αŠ•', 'αˆ˜α‰»αˆ‰', 'α‰₯α‹™', 'α‹¨αˆšαŠαŒˆαˆ­αˆˆα‰΅', 'αŒ‰α‹³α‹­', 'αŠα‹', 'ፒ']
25
+ ```