cahya commited on
Commit
e346883
1 Parent(s): 10b22bf

updated the readme

Browse files
Files changed (1) hide show
  1. README.md +55 -6
README.md CHANGED
@@ -1,14 +1,63 @@
1
  ---
2
  language: id
3
- license: apache-2.0
4
- datasets:
5
- - id_liputan6
6
  tags:
 
7
  - summarization
 
 
 
 
8
  ---
9
 
10
- Bert2Bert Summarization with EncoderDecoder Framework.
11
- This model is a warm-started *BERT2BERT* model fine-tuned on the *id_liputan6* summarization dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- Detail about this model will be added soon.
14
 
1
  ---
2
  language: id
 
 
 
3
  tags:
4
+ - pipeline:summarization
5
  - summarization
6
+ - bert2bert
7
+ datasets:
8
+ - id_liputan6
9
+ license: apache-2.0
10
  ---
11
 
12
+ # Indonesian BERT2BERT Summarization Model
13
+
14
+ Finetuned BERT-base summarization model for Indonesian.
15
+
16
+ ## Finetuning Corpus
17
+
18
+ `bert2bert-indonesian-summarization` model is based on `cahya/bert-base-indonesian-1.5G` by [cahya](https://huggingface.co/cahya), finetuned using [id_liputan6](https://huggingface.co/datasets/id_liputan6) dataset.
19
+
20
+ ## Load Finetuned Model
21
+
22
+ ```python
23
+ from transformers import BertTokenizer, EncoderDecoderModel
24
+
25
+ tokenizer = BertTokenizer.from_pretrained("cahya/bert2bert-indonesian-summarization")
26
+ tokenizer.bos_token = tokenizer.cls_token
27
+ tokenizer.eos_token = tokenizer.sep_token
28
+ model = EncoderDecoderModel.from_pretrained("cahya/bert2bert-indonesian-summarization")
29
+ ```
30
+
31
+ ## Code Sample
32
+
33
+ ```python
34
+ from transformers import BertTokenizer, EncoderDecoderModel
35
+
36
+ tokenizer = BertTokenizer.from_pretrained("cahya/bert2bert-indonesian-summarization")
37
+ tokenizer.bos_token = tokenizer.cls_token
38
+ tokenizer.eos_token = tokenizer.sep_token
39
+ model = EncoderDecoderModel.from_pretrained("cahya/bert2bert-indonesian-summarization")
40
+
41
+ #
42
+ ARTICLE_TO_SUMMARIZE = ""
43
+
44
+ # generate summary
45
+ input_ids = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors='pt')
46
+ summary_ids = model.generate(input_ids,
47
+ max_length=100,
48
+ num_beams=2,
49
+ repetition_penalty=2.5,
50
+ length_penalty=1.0,
51
+ early_stopping=True,
52
+ no_repeat_ngram_size=2,
53
+ use_cache=True)
54
+ summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
55
+ print(summary_text)
56
+ ```
57
+
58
+ Output:
59
+
60
+ ```
61
 
62
+ ```
63