Spaces:
Runtime error
Runtime error
title: Bengali Bpe Tokenizer | |
emoji: π | |
colorFrom: purple | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 4.36.1 | |
app_file: app.py | |
pinned: false | |
license: mit | |
# Bengali BPE Tokenizer | |
## Dataset | |
Multiple references of raw Bengali corpus are available at this [GitHub link](https://github.com/sagorbrur/bangla-corpus). Used following references from that for gathering raw bengali text for the purpose of training the tokenizer. | |
- [Tab-delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) - These are selected sentence pairs from the [Tatoeba Project](http://tatoeba.org/home). This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization | |
- [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) - Only the input data from validation dataset of [Bengali paraphrases](https://huggingface.co/datasets/ai4bharat/IndicParaphrase/blob/main/data/bn_IndicParaphrase_v1.0.zip) are used for the tokenization. That dataset contains 10,000 Bengali sentences. | |
## Tokenizer | |
The Tokenizer artifacts are available at https://huggingface.co/sayanbanerjee32/bengali_tokenizer | |
## The HuggingFace Spaces Gradio App | |
The App takes one or more Bengali sentences as input provide following outputs | |
1. Numeric tokens that represent the sentence (using encode function) | |
2. Regenerated sentence using the tokens (using decode function) | |
3. A visualization for each token to Bengali text mapping as explanation for the tokenization. | |