Arabic BPE Tokenization Using Google Sentance Piece.

Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing.

Vocab size: 32000

Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding

License: Waqf v2

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.