--- language: - bn pipeline_tag: summarization --- # This model aims to summarize bangla text. ### Model Description [flax-community/gpt2-bengali](https://huggingface.co/flax-community/gpt2-bengali) was fine tuned on [BANSData: A Dataset for Bengali Abstractive News Summarization](https://www.kaggle.com/datasets/prithwirajsust/bengali-news-summarization-dataset) and [Bangla Summarization Dataset(Prothom Alo)](https://www.kaggle.com/datasets/towhidahmedfoysal/bangla-summarization-datasetprothom-alo) - **Developed by:** Faridul Reza Sagor & Abdul Wadud Shakib - **Model type:** GPT2LMHeadModel - **Language(s) (NLP):** Bengali - **Finetuned from model:** [flax-community/gpt2-bengali](https://huggingface.co/flax-community/gpt2-bengali) ## Uses Caution: As this model was mainly trained on data from newspaper, it is not good at summarizing the bangla story or dialog or excerpts. ```python from transformers import GPT2LMHeadModel, AutoTokenizer import re tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt2-bengali") model = GPT2LMHeadModel.from_pretrained("faridulreza/gpt2-bangla-summurizer") model.to("cuda") BEGIN_TOKEN = "<।summary_begin।>" END_TOKEN = " <।summary_end।>" BEGIN_TOKEN_ALT = "<।sum_begin।>" END_TOKEN_ALT = " <।sum_end।>" SUMMARY_TOKEN = "<।summary।>" def processTxt(txt): txt = re.sub(r"।", "। ", txt) txt = re.sub(r",", ", ", txt) txt = re.sub(r"!", "। ", txt) txt = re.sub(r"\?", "। ", txt) txt = re.sub(r"\"", "", txt) txt = re.sub(r"'", "", txt) txt = re.sub(r"’", "", txt) txt = re.sub(r"’", "", txt) txt = re.sub(r"‘", "", txt) txt = re.sub(r";", "। ", txt) txt = re.sub(r"\s+", " ", txt) return txt def index_of(val, in_text, after=0): try: return in_text.index(val, after) except ValueError: return -1 def summarize(txt): txt = processTxt(txt.strip()) txt = BEGIN_TOKEN + txt + SUMMARY_TOKEN inputs = tokenizer(txt, max_length=800, truncation=True, return_tensors="pt") inputs.to("cuda") output = model.generate(inputs["input_ids"], max_length=len(txt) + 220, pad_token_id=tokenizer.eos_token_id) txt = tokenizer.batch_decode(output, skip_special_tokens=True)[0] start = index_of(SUMMARY_TOKEN, txt) + len(SUMMARY_TOKEN) print("Whole text completion: \n",txt) if start == len(SUMMARY_TOKEN) - 1: return "No Summary!" end = index_of(END_TOKEN, txt, start) if end == -1: end = index_of(END_TOKEN_ALT, txt, start) if end == -1: end = index_of(BEGIN_TOKEN, txt, start) if end == -1: return txt[start:].strip() txt = txt[start:end].strip() end = index_of(SUMMARY_TOKEN,txt) if end == -1: return txt else: return txt[:end].strip() summarize('your_bengali_text') ``` ## Contact faridul.reza.sagor@gmail.com