---
language:
- bn
pipeline_tag: summarization
---

# This model aims to summarize bangla text.

### Model Description

[flax-community/gpt2-bengali](https://huggingface.co/flax-community/gpt2-bengali) was fine tuned on
[BANSData: A Dataset for Bengali Abstractive News Summarization](https://www.kaggle.com/datasets/prithwirajsust/bengali-news-summarization-dataset) and
[Bangla Summarization Dataset(Prothom Alo)](https://www.kaggle.com/datasets/towhidahmedfoysal/bangla-summarization-datasetprothom-alo)

- **Developed by:** Faridul Reza Sagor & Abdul Wadud Shakib
- **Model type:** GPT2LMHeadModel
- **Language(s) (NLP):** Bengali
- **Finetuned from model:** [flax-community/gpt2-bengali](https://huggingface.co/flax-community/gpt2-bengali)


## Uses
Caution: As this model was mainly trained on data from newspaper, it is not good at
summarizing the bangla story or dialog or excerpts.

```python
from transformers import GPT2LMHeadModel, AutoTokenizer
import re

tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt2-bengali")
model = GPT2LMHeadModel.from_pretrained("faridulreza/gpt2-bangla-summurizer")

model.to("cuda")

BEGIN_TOKEN = "<।summary_begin।>"
END_TOKEN = " <।summary_end।>"
BEGIN_TOKEN_ALT = "<।sum_begin।>"
END_TOKEN_ALT = " <।sum_end।>"
SUMMARY_TOKEN = "<।summary।>"

def processTxt(txt):
    txt = re.sub(r"।", "। ", txt)
    txt = re.sub(r",", ", ", txt)
    txt = re.sub(r"!", "। ", txt)
    txt = re.sub(r"\?", "। ", txt)
    txt = re.sub(r"\"", "", txt)
    txt = re.sub(r"'", "", txt)
    txt = re.sub(r"’", "", txt)
    txt = re.sub(r"’", "", txt)
    txt = re.sub(r"‘", "", txt)
    txt = re.sub(r";", "। ", txt)

    txt = re.sub(r"\s+", " ", txt)

    return txt


def index_of(val, in_text, after=0):
    try:
        return in_text.index(val, after)
    except ValueError:
        return -1

def summarize(txt):
    txt = processTxt(txt.strip())
    txt = BEGIN_TOKEN + txt + SUMMARY_TOKEN

    inputs = tokenizer(txt, max_length=800, truncation=True, return_tensors="pt")
    inputs.to("cuda")
    output = model.generate(inputs["input_ids"], max_length=len(txt) + 220, pad_token_id=tokenizer.eos_token_id)

    txt = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

    start = index_of(SUMMARY_TOKEN, txt) + len(SUMMARY_TOKEN)

    print("Whole text completion: \n",txt)
    if start == len(SUMMARY_TOKEN) - 1:
        return "No Summary!"

    end = index_of(END_TOKEN, txt, start)

    if end == -1:
        end = index_of(END_TOKEN_ALT, txt, start)

    if end == -1:
        end = index_of(BEGIN_TOKEN, txt, start)

    if end == -1:
        return txt[start:].strip()

    txt = txt[start:end].strip()

    end = index_of(SUMMARY_TOKEN,txt)

    if end == -1:
        return txt
    else:
        return txt[:end].strip()


summarize('your_bengali_text')
```


## Contact
  faridul.reza.sagor@gmail.com