tarekeldeeb commited on
Commit
5d048d3
1 Parent(s): 1e9677b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -1,3 +1,14 @@
1
  ---
2
  license: other
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
+ language:
4
+ - ar
5
  ---
6
+ Arabic BPE Tokenization Using Google Sentance Piece.
7
+
8
+ Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing.
9
+
10
+ Vocab size: 8000 (32K also available)
11
+
12
+ Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding
13
+
14
+ License: [Waqf v2](https://github.com/ojuba-org/waqf/tree/master/2.0)