phucdev commited on
Commit
4d27d08
·
1 Parent(s): 92e63d2

Add blanc score module

Browse files
Files changed (4) hide show
  1. README.md +178 -7
  2. app.py +6 -0
  3. blanc_score.py +190 -0
  4. requirements.txt +2 -0
README.md CHANGED
@@ -1,12 +1,183 @@
1
- ---
2
- title: Blanc
3
- emoji: 🔥
4
- colorFrom: red
5
- colorTo: red
 
 
 
6
  sdk: gradio
7
- sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.models.paligemma.convert_paligemma_weights_to_hf import device---
2
+ title: BLANC
3
+ datasets:
4
+ -
5
+ tags:
6
+ - evaluate
7
+ - metric
8
+ description: "BLANC is a reference-free metric that evaluates the quality of document summaries by measuring how much they improve a pre-trained language model's performance on the document's text. It estimates summary quality without needing human-written references, using two variations: BLANC-help and BLANC-tune."
9
  sdk: gradio
10
+ sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
13
  ---
14
 
15
+ # Metric Card for BLANC
16
+
17
+ ## Metric Description
18
+ BLANC is a reference-free evaluation metric designed to estimate the quality of document summaries. It assesses how much a summary helps a pre-trained language model (such as BERT) perform language understanding tasks on the original document's text.
19
+
20
+ There are two variations of BLANC:
21
+
22
+ 1. BLANC-help: The summary is concatenated with document sentences during the inference task. The BLANC-help score is defined as the difference in accuracy between unmasking tokens with the summary and with filler text (same length as the summary but consisting of period symbols). It measures how much the summary boosts the model's performance on masked tokens.
23
+ 2. BLANC-tune: The model is fine-tuned on the summary text before it processes the entire document. The BLANC-tune score is calculated by comparing the performance of the fine-tuned model with that of the original model, both tasked with unmasking tokens in the document text. This method reflects how much the model's ability to understand the document improves after learning from the summary.
24
+
25
+ Unlike traditional metrics such as ROUGE, BLANC does not require human-written reference summaries, making it fully human-free.
26
+
27
+ ## How to Use
28
+ BLANC takes 2 mandatory arguments: `documents` (a list of documents) and `summaries` (a list of predicted summaries).
29
+ You can specify which BLANC variation to use via the `blanc_score` parameter: `help` or `tune`.
30
+ ```python
31
+ from evaluate import load
32
+
33
+ blanc = load("phucdev/blanc")
34
+ documents = ["Jack drove his minivan to the bazaar to purchase milk and honey for his large family."]
35
+ summaries = ["Jack bought milk and honey."]
36
+ results = blanc.compute(documents=documents, summaries=summaries, blanc_score="help")
37
+ ```
38
+
39
+ ### Inputs
40
+
41
+ Args:
42
+ - `documents` (_list of str_): Documents.
43
+ - `summaries` (_list of str_): Predicted summaries.
44
+ - `model_name` (_str, optional_): BERT model type to use for evaluation. Default is `bert-base-uncased`.
45
+ - `measure` (_str, optional_): Measure type, either `improve` or `relative`, as defined in the BLANC paper. Default is `relative`.
46
+ - `blanc_score` (_str, optional_): BLANC score type, either `help` or `tune`. Default is `help`.
47
+ - `gap` (_int, optional_): Distance between words to mask during inference. Default is `2`.
48
+ - `gap_mask` (_int, optional_): Number of tokens to mask at each designated position during inference. Default is `1`.
49
+ - `gap_tune` (_int, optional_): Distance between words to mask during fine-tuning. Default is `2`.
50
+ - `gap_mask_tune` (_int, optional_): Number of tokens to mask at each designated position during fine-tuning. Default is `1`.
51
+ - `min_token_length_normal` (_int, optional_): Minimum number of characters in normal tokens to mask (whole words) during inference. Default is `4`.
52
+ - `min_token_length_lead` (_int, optional_): Minimum number of characters in lead tokens (first part of words) to mask during inference. Default is `2`.
53
+ - `min_token_length_followup` (_int, optional_): Minimum number of characters in follow-up tokens (continuations of words) to mask during inference. Default is `100`.
54
+ - `min_token_length_normal_tune` (_int, optional_): Minimum number of characters in normal tokens to mask during fine-tuning. Default is `-1`.
55
+ - `min_token_length_lead_tune` (_int, optional_): Minimum number of characters in lead tokens to mask during fine-tuning. Default is `-1`.
56
+ - `min_token_length_followup_tune` (_int, optional_): Minimum number of characters in follow-up tokens to mask during fine-tuning. Default is `-1`.
57
+ - `device` (_str, optional_): Device to run the model on, either `cpu` or `cuda`. BLANC is run on `cpu` per default.
58
+ - `random_seed` (_int, optional_): Random seed for Python and PyTorch. Default is `0`.
59
+ - `inference_batch_size` (_int, optional_): Batch size for inference. Default is `1`.
60
+ - `inference_mask_evenly` (_bool, optional_): Whether to mask every `gap` tokens during inference (`True`) or mask randomly with a probability of 0.15 (`False`). Default is `True`.
61
+
62
+ BLANC-help specific arguments:
63
+ - `filler_token` (_str, optional_): Token to use as filler in lieu of the summary. Default is `.`.
64
+ - `help_sep` (_str, optional_): Token used to separate the summary (or filler) from the sentence, or '' for no separator. Default is "".
65
+
66
+ BLANC-tune specific arguments:
67
+ - `finetune_batch_size` (_int, optional_): Batch size to use when fine-tuning on the summary. Default is `1`.
68
+ - `finetune_epochs` (_int, optional_): Number of epochs for fine-tuning on the summary. Default is `10`.
69
+ - `finetune_mask_evenly` (_bool, optional_): Whether to mask every `gap` tokens during fine-tuning (`True`) or mask randomly with a probability of 0.15 (`False`). Default is `True`.
70
+ - `finetune_chunk_size` (_int, optional_): Number of summary tokens to use at a time during fine-tuning. Default is `64`.
71
+ - `finetune_chunk_stride` (_int, optional_): Number of tokens between summary chunks for fine-tuning. Default is `32`.
72
+ - `learning_rate` (_float, optional_): Learning rate for fine-tuning on the summary. Default is `5e-05`.
73
+ - `warmup_steps` (_int, optional_): Number of warmup steps for fine-tuning. Default is `0`.
74
+
75
+ ### Output Values
76
+
77
+ The metric outputs a dictionary with the following key ("blanc_tune" or "blanc_help" depending on the chosen score) and value:
78
+
79
+ - blanc_{tune,help}: A floating-point score representing the quality of the summary.
80
+
81
+ The BLANC score typically ranges from 0 (summary is not helpful) to 0.3 (summary provides a 30% improvement in performance), although it can theoretically range between -1 and 1. Higher scores indicate better quality summaries.
82
+
83
+ #### Values from Popular Papers
84
+ Goyal et al. (2022) compare the performance of different summarization systems using reference-free automatic metrics in [News Summarization and Evaluation in the Era of GPT-3](https://arxiv.org/abs/2209.12356).
85
+ For the DailyMail dataset they report the following BLANC scores:
86
+
87
+ - PEGASUS: 0.1137
88
+ - BRIO: 0.1217
89
+ - T0: 0.0889
90
+ - GPT3-D2: 0.0983
91
+
92
+ ### Examples
93
+ BLANC-help:
94
+ ```python
95
+ from evaluate import load
96
+
97
+ blanc = load("phucdev/blanc")
98
+ documents = ["Jack drove his minivan to the bazaar to purchase milk and honey for his large family."]
99
+ summaries = ["Jack bought milk and honey."]
100
+ results = blanc.compute(documents=documents, summaries=summaries, blanc_score="help")
101
+ ```
102
+ BLANC-tune:
103
+ ```python
104
+ from evaluate import load
105
+
106
+ blanc = load("phucdev/blanc")
107
+ documents = ["Jack drove his minivan to the bazaar to purchase milk and honey for his large family."]
108
+ summaries = ["Jack bought milk and honey."]
109
+ results = blanc.compute(
110
+ documents=documents,
111
+ summaries=summaries,
112
+ blanc_score="tune",
113
+ finetune_mask_evenly=False,
114
+ show_progress_bar=False
115
+ )
116
+ ```
117
+ By default, BLANC is run on the CPU. Using CUDA with batching is much faster:
118
+
119
+ BLANC-help:
120
+ ```python
121
+ from evaluate import load
122
+
123
+ blanc = load("phucdev/blanc")
124
+ documents = ["Jack drove his minivan to the bazaar to purchase milk and honey for his large family."]
125
+ summaries = ["Jack bought milk and honey."]
126
+ results = blanc.compute(
127
+ documents=documents,
128
+ summaries=summaries,
129
+ blanc_score="help",
130
+ device="cuda",
131
+ inference_batch_size=128
132
+ )
133
+ ```
134
+ BLANC-tune:
135
+ ```python
136
+ from evaluate import load
137
+
138
+ blanc = load("phucdev/blanc")
139
+ documents = ["Jack drove his minivan to the bazaar to purchase milk and honey for his large family."]
140
+ summaries = ["Jack bought milk and honey."]
141
+ results = blanc.compute(
142
+ documents=documents,
143
+ summaries=summaries,
144
+ blanc_score="tune",
145
+ device="cuda",
146
+ inference_batch_size=24,
147
+ finetune_mask_evenly=False,
148
+ finetune_batch_size=24
149
+ )
150
+ ```
151
+
152
+ ## Limitations and Bias
153
+ - Summary Length: BLANC tends to favor longer summaries as they generally provide more context and help the model better understand the document.
154
+ - No Reference Summaries: BLANC operates without human-written reference summaries, which may be advantageous in certain cases but could lack the nuance that human judgment provides.
155
+ - Limited by Language Model: The quality of BLANC scores is influenced by the choice of the underlying pre-trained language model (e.g., BERT), which may introduce biases inherent to the model itself.
156
+
157
+ ## Citation
158
+ ```tex
159
+ @inproceedings{vasilyev-etal-2020-fill,
160
+ title = "Fill in the {BLANC}: Human-free quality estimation of document summaries",
161
+ author = "Vasilyev, Oleg and
162
+ Dharnidharka, Vedant and
163
+ Bohannon, John",
164
+ editor = "Eger, Steffen and
165
+ Gao, Yang and
166
+ Peyrard, Maxime and
167
+ Zhao, Wei and
168
+ Hovy, Eduard",
169
+ booktitle = "Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems",
170
+ month = nov,
171
+ year = "2020",
172
+ address = "Online",
173
+ publisher = "Association for Computational Linguistics",
174
+ url = "https://aclanthology.org/2020.eval4nlp-1.2",
175
+ doi = "10.18653/v1/2020.eval4nlp-1.2",
176
+ pages = "11--20",
177
+ abstract = "We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document{'}s text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.",
178
+ }
179
+ ```
180
+
181
+ ## Further References
182
+ - [BLANC paper: Fill in the BLANC: Human-free quality estimation of document summaries](https://aclanthology.org/2020.eval4nlp-1.2/)
183
+ - [BLANC GitHub Repository](https://github.com/PrimerAI/blanc)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("phucdev/blanc_score")
6
+ launch_gradio_widget(module)
blanc_score.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import datasets
2
+ import evaluate
3
+
4
+ from blanc import BlancHelp, BlancTune
5
+
6
+
7
+
8
+
9
+ _DESCRIPTION = """
10
+ BLANC is an automatic method for estimating the quality of document summaries without the need for human-written reference summaries. It works by measuring how much a summary helps a pre-trained language model (like BERT) perform a language understanding task, such as filling in masked words in a document. The two main variations are:
11
+
12
+ 1. BLANC-help: The summary is concatenated with document sentences during the inference task. The BLANC-help score is defined as the difference in accuracy between unmasking tokens with the summary and with filler text (same length as the summary but consisting of period symbols). It measures how much the summary boosts the model's performance on masked tokens.
13
+ 2. BLANC-tune: The model is fine-tuned on the summary text before it processes the entire document. The BLANC-tune score is calculated by comparing the performance of the fine-tuned model with that of the original model, both tasked with unmasking tokens in the document text. This method reflects how much the model's ability to understand the document improves after learning from the summary.
14
+
15
+ These BLANC measures show good correlation with human evaluations, similar to ROUGE scores, but do not require reference summaries.
16
+ See the BLANC paper for more details: https://aclanthology.org/2020.eval4nlp-1.2/
17
+ """
18
+
19
+ _KWARGS_DESCRIPTION = """
20
+ Args:
21
+ documents (list of str): Documents.
22
+ summaries (list of str): Predicted summaries.
23
+ model_name (str, optional): BERT model type to use for evaluation. Default is "bert-base-uncased".
24
+ measure (str, optional): Measure type, either "improve" or "relative", as defined in the BLANC paper. Default is "relative".
25
+ blanc_score (str, optional): BLANC score type, either "help" or "tune". Default is "help".
26
+ gap (int, optional): Distance between words to mask during inference. Default is 2.
27
+ gap_mask (int, optional): Number of tokens to mask at each designated position during inference. Default is 1.
28
+ gap_tune (int, optional): Distance between words to mask during fine-tuning. Default is 2.
29
+ gap_mask_tune (int, optional): Number of tokens to mask at each designated position during fine-tuning. Default is 1.
30
+ min_token_length_normal (int, optional): Minimum number of characters in normal tokens to mask (whole words) during inference. Default is 4.
31
+ min_token_length_lead (int, optional): Minimum number of characters in lead tokens (first part of words) to mask during inference. Default is 2.
32
+ min_token_length_followup (int, optional): Minimum number of characters in follow-up tokens (continuations of words) to mask during inference. Default is 100.
33
+ min_token_length_normal_tune (int, optional): Minimum number of characters in normal tokens to mask during fine-tuning. Default is -1.
34
+ min_token_length_lead_tune (int, optional): Minimum number of characters in lead tokens to mask during fine-tuning. Default is -1.
35
+ min_token_length_followup_tune (int, optional): Minimum number of characters in follow-up tokens to mask during fine-tuning. Default is -1.
36
+ device (str, optional): Device to run the model on, either "cpu" or "cuda". Defaults to "cpu".
37
+ random_seed (int, optional): Random seed for Python and PyTorch. Default is 0.
38
+ inference_batch_size (int, optional): Batch size for inference. Default is 1.
39
+ inference_mask_evenly (bool, optional): Whether to mask every `gap` tokens during inference (True) or mask randomly with a probability of 0.15 (False). Default is True.
40
+
41
+ BLANC-help specific arguments:
42
+ filler_token (str, optional): Token to use as filler in lieu of the summary. Default is ".".
43
+ help_sep (str, optional): Token used to separate the summary (or filler) from the sentence, or '' for no separator. Default is "".
44
+
45
+ BLANC-tune specific arguments:
46
+ finetune_batch_size (int, optional): Batch size to use when fine-tuning on the summary. Default is 1.
47
+ finetune_epochs (int, optional): Number of epochs for fine-tuning on the summary. Default is 10.
48
+ finetune_mask_evenly (bool, optional): Whether to mask every `gap` tokens during fine-tuning (True) or mask randomly with a probability of 0.15 (False). Default is True.
49
+ finetune_chunk_size (int, optional): Number of summary tokens to use at a time during fine-tuning. Default is 64.
50
+ finetune_chunk_stride (int, optional): Number of tokens between summary chunks for fine-tuning. Default is 32.
51
+ learning_rate (float, optional): Learning rate for fine-tuning on the summary. Default is 5e-05.
52
+ warmup_steps (int, optional): Number of warmup steps for fine-tuning. Default is 0.
53
+
54
+ Returns:
55
+ score (float): The calculated BLANC score based on the selected method (BLANC-help or BLANC-tune).
56
+ """
57
+
58
+ _CITATION = """
59
+ @inproceedings{vasilyev-etal-2020-fill,
60
+ title = "Fill in the {BLANC}: Human-free quality estimation of document summaries",
61
+ author = "Vasilyev, Oleg and
62
+ Dharnidharka, Vedant and
63
+ Bohannon, John",
64
+ editor = "Eger, Steffen and
65
+ Gao, Yang and
66
+ Peyrard, Maxime and
67
+ Zhao, Wei and
68
+ Hovy, Eduard",
69
+ booktitle = "Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems",
70
+ month = nov,
71
+ year = "2020",
72
+ address = "Online",
73
+ publisher = "Association for Computational Linguistics",
74
+ url = "https://aclanthology.org/2020.eval4nlp-1.2",
75
+ doi = "10.18653/v1/2020.eval4nlp-1.2",
76
+ pages = "11--20",
77
+ abstract = "We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document{'}s text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.",
78
+ }
79
+ """
80
+
81
+
82
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
83
+ class BLANC(evaluate.Metric):
84
+ def _info(self):
85
+ return evaluate.MetricInfo(
86
+ description=_DESCRIPTION,
87
+ citation=_CITATION,
88
+ homepage="https://github.com/PrimerAI/blanc",
89
+ inputs_description=_KWARGS_DESCRIPTION,
90
+ features=[
91
+ datasets.Features(
92
+ {
93
+ "documents": datasets.Value("string", id="sequence"),
94
+ "summaries": datasets.Value("string", id="sequence"),
95
+ }
96
+ ),
97
+ ],
98
+ codebase_urls=["https://github.com/PrimerAI/blanc"],
99
+ reference_urls=[
100
+ "https://github.com/PrimerAI/blanc",
101
+ "https://aclanthology.org/2020.eval4nlp-1.2/",
102
+ ],
103
+ )
104
+
105
+ def _compute(
106
+ self,
107
+ documents,
108
+ summaries,
109
+ model_name="bert-base-uncased",
110
+ blanc_score="help",
111
+ measure="relative",
112
+ gap=2,
113
+ gap_mask=1,
114
+ gap_tune=2,
115
+ gap_mask_tune=1,
116
+ min_token_length_normal=4,
117
+ min_token_length_lead=2,
118
+ min_token_length_followup=100,
119
+ min_token_length_normal_tune=-1,
120
+ min_token_length_lead_tune=-1,
121
+ min_token_length_followup_tune=-1,
122
+ device="cpu",
123
+ random_seed=0,
124
+ inference_batch_size=1,
125
+ inference_mask_evenly=True,
126
+ filler_token=".",
127
+ help_sep="",
128
+ finetune_batch_size=1,
129
+ finetune_epochs=10,
130
+ finetune_mask_evenly=True,
131
+ finetune_chunk_size=64,
132
+ finetune_chunk_stride=32,
133
+ learning_rate=5e-05,
134
+ warmup_steps=0,
135
+ ):
136
+ # Choose between BLANC-help and BLANC-tune methods based on measure argument.
137
+ if blanc_score == "help":
138
+ blanc_instance = BlancHelp(
139
+ model_name=model_name,
140
+ measure=measure,
141
+ gap=gap,
142
+ gap_mask=gap_mask,
143
+ gap_tune=gap_tune,
144
+ gap_mask_tune=gap_mask_tune,
145
+ min_token_length_normal=min_token_length_normal,
146
+ min_token_length_lead=min_token_length_lead,
147
+ min_token_length_followup=min_token_length_followup,
148
+ min_token_length_normal_tune=min_token_length_normal_tune,
149
+ min_token_length_lead_tune=min_token_length_lead_tune,
150
+ device=device,
151
+ inference_batch_size=inference_batch_size,
152
+ inference_mask_evenly=inference_mask_evenly,
153
+ filler_token=filler_token,
154
+ help_sep=help_sep,
155
+ )
156
+ elif blanc_score == "tune":
157
+ blanc_instance = BlancTune(
158
+ model_name=model_name,
159
+ measure=measure,
160
+ gap=gap,
161
+ gap_mask=gap_mask,
162
+ gap_tune=gap_tune,
163
+ gap_mask_tune=gap_mask_tune,
164
+ min_token_length_normal=min_token_length_normal,
165
+ min_token_length_lead=min_token_length_lead,
166
+ min_token_length_followup_tune=min_token_length_followup_tune,
167
+ min_token_length_normal_tune=min_token_length_normal_tune,
168
+ min_token_length_lead_tune=min_token_length_lead_tune,
169
+ device=device,
170
+ random_seed=random_seed,
171
+ inference_batch_size=inference_batch_size,
172
+ inference_mask_evenly=inference_mask_evenly,
173
+ finetune_batch_size=finetune_batch_size,
174
+ finetune_epochs=finetune_epochs,
175
+ finetune_mask_evenly=finetune_mask_evenly,
176
+ finetune_chunk_size=finetune_chunk_size,
177
+ finetune_chunk_stride=finetune_chunk_stride,
178
+ learning_rate=learning_rate,
179
+ warmup_steps=warmup_steps,
180
+ )
181
+ else:
182
+ raise ValueError(f"Invalid measure argument: {measure}. Choose 'help' or 'tune'.")
183
+
184
+ score = blanc_instance.eval_pairs(documents, summaries)
185
+
186
+ output_dict = {
187
+ f"blanc_{blanc_score}": score # Replace with actual computed score
188
+ }
189
+
190
+ return output_dict
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ blanc~=0.3.4
2
+ evaluate~=0.4.3