|
--- |
|
license: mit |
|
language: |
|
- ar |
|
library_name: tokenizers |
|
pipeline_tag: summarization |
|
tags: |
|
- arabic |
|
- summarization |
|
- tokenizers |
|
- BPE |
|
--- |
|
|
|
## Byte Level (BPE) Tokenizer for Arabic |
|
|
|
Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency. |
|
This tokenizer utilizes a `Byte-Pair Encoding (BPE)` approach to create a vocabulary of `50,000` tokens, catering specifically to the intricacies of the Arabic language. |
|
|
|
### Goal |
|
|
|
This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using `PyTorch`. |
|
In adherence to the configurations outlined in the official [BART](https://arxiv.org/abs/1910.13461) paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic. |
|
While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a `BPE` tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks. |
|
|
|
### Checkpoint Information |
|
|
|
- **Name**: `IsmaelMousa/arabic-bpe-tokenizer` |
|
- **Vocabulary Size**: `50,000` |
|
|
|
### Overview |
|
|
|
The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language. |
|
|
|
### Features |
|
|
|
- **Byte-Pair Encoding (BPE)**: Efficiently manages a large vocabulary size while maintaining accuracy. |
|
- **Comprehensive Coverage**: Handles Arabic script, including diacritics and various word forms. |
|
- **Flexible Integration**: Easily integrates with the `tokenizers` library for seamless tokenization. |
|
|
|
### Installation |
|
|
|
To use this tokenizer, you need to install the `tokenizers` library. If you haven’t installed it yet, you can do so using pip: |
|
|
|
```bash |
|
pip install tokenizers |
|
``` |
|
|
|
### Example Usage |
|
Here is an example of how to use the Byte Level Tokenizer with the `tokenizers` library. |
|
|
|
|
|
This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي": |
|
|
|
```python |
|
from tokenizers import Tokenizer |
|
|
|
tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer") |
|
|
|
text = "لاشيء يعجبني, أريد أن أبكي" |
|
|
|
encoded = tokenizer.encode(text) |
|
decoded = tokenizer.decode(encoded.ids) |
|
|
|
print("Encoded Tokens:", encoded.tokens) |
|
print("Token IDs:", encoded.ids) |
|
print("Decoded Text:", decoded) |
|
|
|
``` |
|
|
|
output: |
|
|
|
```bash |
|
Encoded Tokens: ['<s>', 'ÙĦا', 'ĠØ´ÙĬØ¡', 'ĠÙĬع', 'جب', 'ÙĨÙĬ', ',', 'ĠأرÙĬد', 'ĠØ£ÙĨ', 'Ġأب', 'ÙĥÙĬ', '</s>'] |
|
|
|
Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2] |
|
|
|
Decoded Text: لا شيء يعجبني, أريد أن أبكي |
|
``` |
|
|
|
### Tokenizer Details |
|
- **Byte-Level Tokenization**: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts. |
|
- **Adaptability**: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios. |
|
|
|
### License |
|
This project is licensed under the `MIT` License. |