|
--- |
|
datasets: |
|
- SKNahin/bengali-transliteration-data |
|
base_model: |
|
- facebook/mbart-large-50-many-to-many-mmt |
|
tags: |
|
- nlp |
|
- seq2seq |
|
--- |
|
|
|
# Model Card for Banglish to Bengali Transliteration using mBART |
|
|
|
This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) model. The training was conducted using the dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data). |
|
|
|
The notebook used for training can be found here: [Kaggle Notebook](https://www.kaggle.com/code/shadabtanjeed/mbart-banglish-to-bengali-transliteration). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Shadab Tanjeed |
|
- **Model type:** Sequence-to-sequence (Seq2Seq) Transformer model |
|
- **Language(s) (NLP):** Bengali, Banglish (Romanized Bengali) |
|
- **Finetuned from model:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The model is intended for direct transliteration of Banglish text to Bengali script. |
|
|
|
### Downstream Use |
|
|
|
It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- The model may struggle with ambiguous words that have multiple possible transliterations. |
|
- It may not perform well on informal or highly stylized text. |
|
- Limited dataset coverage could lead to errors in transliterating uncommon words. |
|
|
|
### Recommendations |
|
|
|
Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary. |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, MBartTokenizer |
|
|
|
model_name = "facebook/mbart-large-50-many-to-many-mmt" |
|
tokenizer = MBartTokenizer.from_pretrained(model_name) |
|
model = MBartForConditionalGeneration.from_pretrained(model_name) |
|
|
|
text = "ami tomake bhalobashi" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
translated_tokens = model.generate(**inputs) |
|
output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) |
|
|
|
print(output) # Expected Bengali transliteration |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The dataset used for training is [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data), which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script. |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
- Tokenization was performed using the mBART tokenizer. |
|
- Text normalization techniques were applied to remove noise. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Batch size:** 8 |
|
- **Learning rate:** 3e-5 |
|
- **Epochs:** 5 |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
- The same dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data) was used for evaluation. |
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
The model follows the Transformer-based Seq2Seq architecture from mBART. |
|
|
|
|
|
#### Software |
|
|
|
- **Framework:** Hugging Face Transformers |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the dataset and base model: |
|
|
|
```bibtex |
|
@inproceedings{SKNahin2023, |
|
author = {SK Nahin}, |
|
title = {Bengali Transliteration Dataset}, |
|
year = {2023}, |
|
publisher = {Hugging Face Datasets}, |
|
url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data} |
|
} |
|
|
|
@article{lewis2020mbart, |
|
title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation}, |
|
author={Lewis, Mike and others}, |
|
journal={arXiv preprint arXiv:2001.08210}, |
|
year={2020} |
|
} |
|
|