tarekeldeeb
/

arabic_bpe_8k

Model card Files Files and versions Community

tarekeldeeb commited on Mar 23, 2023

Commit

5d048d3

•

1 Parent(s): 1e9677b

Update README.md

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -1,3 +1,14 @@
 ---
 license: other
 ---

 ---
 license: other
+language:
+- ar
 ---
+Arabic BPE Tokenization Using Google Sentance Piece.
+Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing.
+Vocab size: 8000 (32K also available)
+Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding
+License: [Waqf v2](https://github.com/ojuba-org/waqf/tree/master/2.0)