lnetze commited on
Commit
fc7aef3
1 Parent(s): aac0c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -0
README.md CHANGED
@@ -1,3 +1,169 @@
1
  ---
 
 
 
 
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - de
4
+ # thumbnail: "url to a thumbnail used in social sharing"
5
+ tags:
6
+ - summarization
7
  license: mit
8
+ metrics:
9
+ - rouge
10
  ---
11
+
12
+ # German news title gen
13
+
14
+ This is a model for the task of news headline generation in German.
15
+
16
+ While this task is very similar to summarization, there remain differences like length, structure, and language style, which cause state-of-the-art summarization models not to be suited best for headline generation and demand further fine tuning on this task.
17
+
18
+ For this model, [mT5-base](https://huggingface.co/google/mt5-base) by Google is used as a foundation model.
19
+
20
+ **The model is still work in progress**
21
+
22
+ ## Dataset & preprocessing
23
+ The model was finetuned on a corpus of news articles from [BR24](https://www.br.de/) published between 2015 and 2021. The texts are in german language and cover a range of different news topics like politics, sports, and culture, with a focus on topics that are relevant to the people living in Bavaria (Germany).
24
+
25
+ In a preprocessing step, article-headline pairs matching any of the following criteria were filtered out:
26
+ - very short articles (number of words in text lower than 3x the number of words in the headline).
27
+ - articles with headlines containing only words that are not contained in the text (lemmatized and excluding stopwords).
28
+ - articles with headlines that are just the name of a known text format (e.g. "Das war der Tag" a format summarizing the most important topics of the day)
29
+
30
+ Further the prefix `summarize: ` was added to all articles to make use of the pretrained summarization capabilities of mT5.
31
+
32
+ After filtering the corpus contained 89098 article-headline pairs, of which 87306 were used for training, 902 for validation, and 890 for testing.
33
+
34
+ ## Training
35
+ After multiple test runs of finetuning the present model was further trained using the following parameters:
36
+ - foundation-model: mT5-base
37
+ - input_prefix: "summarize: "
38
+ - num_train_epochs: 10
39
+ - learning_rate: 5e-5
40
+ - warmup_ratio: 0.3
41
+ - lr_scheduler_type: constant_with_warmup
42
+ - per_device_train_batch_size: 3
43
+ - gradient_accumulation_steps: 2
44
+ - fp16: False
45
+
46
+ Every 5000 steps a checkpoint is stored and evaluated on the validation set. After the training, the checkpoint with the best cross-entropy loss on the validation set is saved as the final model.
47
+
48
+ ## Usage
49
+
50
+ Because the model was fine tuned on mT5, the usage is analogous to the T5 model ([see docs](https://huggingface.co/docs/transformers/model_doc/t5)). Another option for using the model for inference is the huggingface [summarization pipeline](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/pipelines#transformers.SummarizationPipeline).
51
+
52
+ In both cases the prefix `summarize: ` has to be added to the input texts.
53
+
54
+ ### Example: Direct model evaluation
55
+
56
+ ```python
57
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
58
+
59
+ model_id = ""
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
62
+
63
+ text = "Als Reaktion auf die Brandserie wurde am Mittwoch bei der Kriminalpolizei Würzburg eine Ermittlungskommission eingerichtet. Ich habe den Eindruck, der Brandstifter wird dreister, kommentiert Rosalinde Schraud, die Bürgermeisterin von Estenfeld, die Brandserie. Gerade die letzten beiden Brandstiftungen seien ungewöhnlich gewesen, da sie mitten am Tag und an frequentierten Straßen stattgefunden haben.Kommt der Brandstifter aus Estenfeld?Norbert Walz ist das letzte Opfer des Brandstifters von Estenfeld. Ein Unbekannter hat am Dienstagnachmittag sein Gartenhaus angezündet.Was da in seinem Kopf herumgeht, was da passiert – das ist ja unglaublich! Das kann schon jemand aus dem Ort sein, weil sich derjenige auskennt.Norbert Walz aus Estenfeld.Dass es sich beim Brandstifter wohl um einen Bürger ihrer Gemeinde handele, will die erste Bürgermeisterin von Estenfeld, Rosalinde Schraud, nicht bestätigen: In der Bevölkerung gibt es natürlich Spekulationen, an denen ich mich aber nicht beteiligen will. Laut Schraud reagiert die Bürgerschaft mit vermehrter Aufmerksamkeit auf die Brände: Man guckt mehr in die Nachbarschaft. Aufhören wird die Brandserie wohl nicht, solange der Täter nicht gefasst wird.Es wäre nicht ungewöhnlich, dass der Täter aus der Umgebung von Estenfeld stammt. Wir bitten deshalb Zeugen, die sachdienliche Hinweise sowohl zu den Bränden geben können, sich mit unserer Kriminalpolizei in Verbindung zu setzen.Philipp Hümmer, Sprecher des Polizeipräsidiums UnterfrankenFür Hinweise, die zur Ergreifung des Täters führen, hat das Bayerische Landeskriminalamt eine Belohnung von 2.000 Euro ausgesetzt."
64
+
65
+ input_text = "summarize: " + text
66
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
67
+
68
+ outputs = model.generate(input_ids)
69
+ generated_headline = tokenizer.decode(outputs[0], skip_special_tokens=True)
70
+ print(generated_headline)
71
+ ```
72
+
73
+ ### Example: Model evaluation using huggingface pipeline
74
+ ```python
75
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
76
+
77
+ model_id = ""
78
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
79
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
80
+ headline_generator = pipeline(
81
+ "summarization", model=model, tokenizer=tokenizer
82
+ )
83
+
84
+ text = "Als Reaktion auf die Brandserie wurde am Mittwoch bei der Kriminalpolizei Würzburg eine Ermittlungskommission eingerichtet. Ich habe den Eindruck, der Brandstifter wird dreister, kommentiert Rosalinde Schraud, die Bürgermeisterin von Estenfeld, die Brandserie. Gerade die letzten beiden Brandstiftungen seien ungewöhnlich gewesen, da sie mitten am Tag und an frequentierten Straßen stattgefunden haben.Kommt der Brandstifter aus Estenfeld?Norbert Walz ist das letzte Opfer des Brandstifters von Estenfeld. Ein Unbekannter hat am Dienstagnachmittag sein Gartenhaus angezündet.Was da in seinem Kopf herumgeht, was da passiert – das ist ja unglaublich! Das kann schon jemand aus dem Ort sein, weil sich derjenige auskennt.Norbert Walz aus Estenfeld.Dass es sich beim Brandstifter wohl um einen Bürger ihrer Gemeinde handele, will die erste Bürgermeisterin von Estenfeld, Rosalinde Schraud, nicht bestätigen: In der Bevölkerung gibt es natürlich Spekulationen, an denen ich mich aber nicht beteiligen will. Laut Schraud reagiert die Bürgerschaft mit vermehrter Aufmerksamkeit auf die Brände: Man guckt mehr in die Nachbarschaft. Aufhören wird die Brandserie wohl nicht, solange der Täter nicht gefasst wird.Es wäre nicht ungewöhnlich, dass der Täter aus der Umgebung von Estenfeld stammt. Wir bitten deshalb Zeugen, die sachdienliche Hinweise sowohl zu den Bränden geben können, sich mit unserer Kriminalpolizei in Verbindung zu setzen.Philipp Hümmer, Sprecher des Polizeipräsidiums UnterfrankenFür Hinweise, die zur Ergreifung des Täters führen, hat das Bayerische Landeskriminalamt eine Belohnung von 2.000 Euro ausgesetzt."
85
+ input_text = "summarize: " + text
86
+
87
+ generated_headline = headline_generator(input_text)[0]["summary_text"]
88
+ print(generated_headline)
89
+
90
+ ```
91
+
92
+
93
+ ## Limitations
94
+ Like most state-of-the-art summarization models this model has issues with the factuality of the generated texts [^factuality]. **It is therefore strongly advised having a human fact-check the generated headlines.**
95
+
96
+ An analysis of possible biases reproduced by the present model, regardless of whether they originate from our finetuning or the underlying mT5 model, is beyond the scope of this work. We assume that biases exist within the model and an analysis will be a task for future work
97
+
98
+ As the model was trained on news articles from the time range 2015-2021, further biases and factual errors could emerge due to topic shifts in news articles and changes in the (e.g. political) situation.
99
+
100
+ ## Evaluation
101
+
102
+ The model was evaluated on a held-out test set consisting of 890 article-headline pairs.
103
+
104
+ ### Quantitative
105
+
106
+ | model | Rouge1 | Rouge2 | RougeL | RougeLsum |
107
+ |-|-|-|-|-|
108
+ | [T-Systems-onsite/mt5-small-sum-de-en-v2](https://huggingface.co/T-Systems-onsite/mt5-small-sum-de-en-v2)| 0.107 | 0.0297 | 0.098 | 0.098 |
109
+ | our-model | 0.3131 | 0.0873 | 0.1997 | 0.1997 |
110
+
111
+ For evaluating the factuality of the generated headlines concerning the input text, we use 3 state-of-the-art metrics for summary evaluation (the parameters were chosen according to the recommendations from the respective papers or GitHub repositories):
112
+
113
+ - **SummaC-CZ** [^summac]
114
+ Yields a score between -1 and 1, representing the difference between entailment probability and contradiction probability (-1: the headline is not entailed in text and is completely contradicted by it, 1: the headline is fully entailed in text and not contradicted by it).
115
+
116
+ Parameters:
117
+ - `model_name`: [vitc](https://huggingface.co/tals/albert-xlarge-vitaminc-mnli)
118
+ - **QAFactEval** [^qafacteval]
119
+ Using Lerc Quip score, which is reported to perform best in the corresponding paper. The score yields a value between 0 and 5 representing the overlap between answers based on the headline and text to questions generated from the headline (0: no overlap, 5: perfect overlap).
120
+
121
+ Parameters:
122
+ - `use_lerc_quip`: True
123
+
124
+ - **DAE (dependency arc entailment)** [^dae]
125
+ Yields a binary value of either 0 or 1, representing whether all dependency arcs in the headline are entailed in the text (0: at least one dependency arc is not entailed, 1: all dependency arcs are entailed).
126
+
127
+ Parameters:
128
+ - model checkpoint: DAE_xsum_human_best_ckpt
129
+ - `model_type`: model_type
130
+ - `max_seq_length`: 512
131
+
132
+
133
+ Each metric is calculated for all article-headline pairs in the test set and the respective mean score over the test set is reported.
134
+
135
+ | model | SummacCZ | QAFactEval | DAE |
136
+ |-|-|-|-|
137
+ | [T-Systems-onsite/mt5-small-sum-de-en-v2](https://huggingface.co/T-Systems-onsite/mt5-small-sum-de-en-v2) | 0.6969 | 3.3023 | 0.8292 |
138
+ | our-model | 0.4419 | 1.9265 | 0.7438 |
139
+
140
+ It can be observed that our model scores consistently lower than the T-Systems one. Following human evaluation, it seems that to match the structure and style specific to headlines the headline generation model has to be more abstractive than a model for summarization which leads to a higher frequency of hallucinations in the generated output.
141
+
142
+ ### Qualitative
143
+ A qualitative evaluation conducted by members of the BR AI + Automation Lab showed that the model succeeds in producing headlines that match the language and style of news headlines, but also confirms that there are issues with the factual consistency common to state-of-the-art summarization models.
144
+
145
+ ## Future work
146
+
147
+ Future work on this model will focus on generating headlines with higher factual consistency regarding the text. Ideas to achieve this goal include:
148
+ - Use of coreference resolution as additional preprocessing step for making the relations within the text more explicit to the model.
149
+ - Use of contrastive learning [^contrastive_learning]
150
+ - Use of different models for different news topics, as different topics seem to be prone to different types of errors, more specialized models may be able to improve performance.
151
+ - Use of factuality metric models for reranking beam search candidates in the generation step.
152
+ - Perform analysis of biases included in the model
153
+
154
+
155
+
156
+ [^factuality]: Maynez, Joshua, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. “On Faithfulness and Factuality in Abstractive Summarization.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1906–19. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-main.173.
157
+
158
+ [^summac]: Laban, Philippe, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. “SummaC: Re-Visiting NLI-Based Models for Inconsistency Detection in Summarization.” Transactions of the Association for Computational Linguistics 10 (February 9, 2022): 163–77. https://doi.org/10.1162/tacl_a_00453.
159
+ Code: https://github.com/tingofurro/summac
160
+
161
+ [^qafacteval]: Fabbri, Alexander R., Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. “QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization.” arXiv, April 29, 2022. https://doi.org/10.48550/arXiv.2112.08542.
162
+ Code: https://github.com/salesforce/QAFactEval
163
+
164
+ [^dae]: Goyal, Tanya, and Greg Durrett. “Annotating and Modeling Fine-Grained Factuality in Summarization.” arXiv, April 9, 2021. http://arxiv.org/abs/2104.04302.
165
+ Code: https://github.com/tagoyal/factuality-datasets
166
+
167
+ [^contrastive_learning]: Cao, Shuyang, and Lu Wang. “CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6633–49. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.emnlp-main.532.
168
+
169
+