language:
- bn
datasets: csebuetnlp/BanglaParaphrase
licenses:
- cc-by-nc-sa-4.0
banglat5_banglaparaphrase
This repository contains the pretrained checkpoint of the model BanglaT5 finetuned on BanglaParaphrase dataset. This is a sequence to sequence transformer model pretrained with the "Span Corruption" objective. Finetuned models using this checkpoint achieve competitive results on the dataset.
For finetuning and inference, refer to the scripts in the official GitHub repository of BanglaNLG.
Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in the official GitHub repository use this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is given below:
Using this model in transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from normalizer import normalize # pip install git+https://github.com/csebuetnlp/normalizer
model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_banglaparaphrase")
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_banglaparaphrase", use_fast=False)
input_sentence = ""
input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids
generated_tokens = model.generate(input_ids)
decoded_tokens = tokenizer.batch_decode(generated_tokens)[0]
print(decoded_tokens)
Benchmarks
- Supervised fine-tuning
Test Set | Model | sacreBLEU | ROUGE-L | PINC | BERTScore | BERT-iBLEU |
---|---|---|---|---|---|---|
BanglaParaphrase | BanglaT5 IndicBART IndicBARTSS |
32.8 5.60 4.90 |
63.58 35.61 33.66 |
74.40 80.26 82.10 |
94.80 91.50 91.10 |
92.18 91.16 90.95 |
IndicParaphrase | BanglaT5 IndicBART IndicBARTSS |
11.0 12.0 10.7 |
19.99 21.58 20.59 |
74.50 76.83 77.60 |
94.80 93.30 93.10 |
87.738 90.65 90.54 |
The dataset can be found in the link below:
Citation
If you use this model, please cite the following paper:
@article{akil2022banglaparaphrase,
title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
journal={arXiv preprint arXiv:2210.05109},
year={2022}
}