IsmaelMousa
/

arabic-bpe-tokenizer

Model card Files Files and versions Community

arabic-bpe-tokenizer / README.md

IsmaelMousa's picture

Update README.md

d926cf1 verified 2 months ago

|

3.39 kB

	---
	license: mit
	language:
	- ar
	library_name: tokenizers
	pipeline_tag: summarization
	tags:
	- arabic
	- summarization
	- tokenizers
	- BPE
	---

	## Byte Level (BPE) Tokenizer for Arabic

	Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency.
	This tokenizer utilizes a `Byte-Pair Encoding (BPE)` approach to create a vocabulary of `50,000` tokens, catering specifically to the intricacies of the Arabic language.

	### Goal

	This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using `PyTorch`.
	In adherence to the configurations outlined in the official [BART](https://arxiv.org/abs/1910.13461) paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic.
	While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a `BPE` tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.

	### Checkpoint Information

	- Name: `IsmaelMousa/arabic-bpe-tokenizer`
	- Vocabulary Size: `50,000`

	### Overview

	The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.

	### Features

	- Byte-Pair Encoding (BPE): Efficiently manages a large vocabulary size while maintaining accuracy.
	- Comprehensive Coverage: Handles Arabic script, including diacritics and various word forms.
	- Flexible Integration: Easily integrates with the `tokenizers` library for seamless tokenization.

	### Installation

	To use this tokenizer, you need to install the `tokenizers` library. If you haven’t installed it yet, you can do so using pip:

	```bash
	pip install tokenizers
	```

	### Example Usage
	Here is an example of how to use the Byte Level Tokenizer with the `tokenizers` library.


	This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":

	```python
	from tokenizers import Tokenizer

	tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")

	text = "لاشيء يعجبني, أريد أن أبكي"

	encoded = tokenizer.encode(text)
	decoded = tokenizer.decode(encoded.ids)

	print("Encoded Tokens:", encoded.tokens)
	print("Token IDs:", encoded.ids)
	print("Decoded Text:", decoded)

	```

	output:

	```bash
	Encoded Tokens: ['<s>', 'ÙĦØ§', 'ĠØ´ÙĬØ¡', 'ĠÙĬØ¹', 'Ø¬Ø¨', 'ÙĨÙĬ', ',', 'ĠØ£Ø±ÙĬØ¯', 'ĠØ£ÙĨ', 'ĠØ£Ø¨', 'ÙĥÙĬ', '</s>']

	Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2]

	Decoded Text: لا شيء يعجبني, أريد أن أبكي
	```

	### Tokenizer Details
	- Byte-Level Tokenization: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
	- Adaptability: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.

	### License
	This project is licensed under the `MIT` License.