Indonesian BERT2BERT Summarization Model
Finetuned EncoderDecoder model using BERT-base and GPT2-small for Indonesian text summarization.
Finetuning Corpus
bert2gpt-indonesian-summarization
model is based on cahya/bert-base-indonesian-1.5G
and cahya/gpt2-small-indonesian-522M
by cahya, finetuned using id_liputan6 dataset.
Load Finetuned Model
from transformers import BertTokenizer, EncoderDecoderModel
tokenizer = BertTokenizer.from_pretrained("cahya/bert2gpt-indonesian-summarization")
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
model = EncoderDecoderModel.from_pretrained("cahya/bert2gpt-indonesian-summarization")
Code Sample
from transformers import BertTokenizer, EncoderDecoderModel
tokenizer = BertTokenizer.from_pretrained("cahya/bert2gpt-indonesian-summarization")
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
model = EncoderDecoderModel.from_pretrained("cahya/bert2gpt-indonesian-summarization")
#
ARTICLE_TO_SUMMARIZE = ""
# generate summary
input_ids = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors='pt')
summary_ids = model.generate(input_ids,
min_length=20,
max_length=80,
num_beams=10,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=2,
use_cache=True,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary_text)
Output:
- Downloads last month
- 555
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.