krotima1 commited on
Commit
9a8788e
2 Parent(s): d9443fd 15d3c4e

Merge branch 'main' of https://huggingface.co/krotima1/mbart-at2h-cs into main

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - cs
4
+ - cs
5
+ tags:
6
+ - abstractive summarization
7
+ - mbart-cc25
8
+ - Czech
9
+ license: apache-2.0
10
+ datasets:
11
+ - private Czech News Center dataset news-based
12
+ - SumeCzech dataset news-based
13
+ metrics:
14
+ - rouge
15
+ - rougeraw
16
+ ---
17
+
18
+ # mBART fine-tuned model for Czech abstractive summarization (AT2H-CS)
19
+ This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Czech news dataset to produce Czech abstractive summaries.
20
+ ## Task
21
+ The model deals with the task ``Abstract + Text to Headline`` (AT2H) which consists in generating a one- or two-sentence summary considered as a headline from a Czech news text.
22
+
23
+ ## Dataset
24
+ The model has been trained on a large Czech news dataset developed by a concatenation of two datasets, the private CNC dataset provided by Czech News Center and [SumeCzech](https://ufal.mff.cuni.cz/sumeczech) dataset. The dataset includes around 1.75M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were set to 512 tokens for the encoder and 64 for the decoder.
25
+
26
+ ## Training
27
+ The model has been trained on 1x NVIDIA Tesla A100 40GB for 40 hours, 1x NVIDIA Tesla V100 32GB for 20 hours, and 4x NVIDIA Tesla A100 40GB for 20 hours. During training, the model has seen 7936K documents corresponding to roughly 5 epochs.
28
+
29
+ # Use
30
+ Assuming that you are using the provided Summarizer.ipynb file.
31
+ ```python
32
+ def summ_config():
33
+ cfg = OrderedDict([
34
+ # summarization model - checkpoint from website
35
+ ("model_name", "krotima1/mbart-at2h-cs"),
36
+ ("inference_cfg", OrderedDict([
37
+ ("num_beams", 4),
38
+ ("top_k", 40),
39
+ ("top_p", 0.92),
40
+ ("do_sample", True),
41
+ ("temperature", 0.89),
42
+ ("repetition_penalty", 1.2),
43
+ ("no_repeat_ngram_size", None),
44
+ ("early_stopping", True),
45
+ ("max_length", 64),
46
+ ("min_length", 10),
47
+ ])),
48
+ #texts to summarize
49
+ ("text",
50
+ [
51
+ "Input your Czech text",
52
+ ]
53
+ ),
54
+ ])
55
+ return cfg
56
+ cfg = summ_config()
57
+ #load model
58
+ model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"])
59
+ tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"])
60
+ # init summarizer
61
+ summarize = Summarizer(model, tokenizer, cfg["inference_cfg"])
62
+ summarize(cfg["text"])
63
+ ```