seonil commited on
Commit
2885a60
1 Parent(s): eb64f0c

push to hfspace

Browse files
README.md CHANGED
@@ -1,13 +1,73 @@
1
- ---
2
- title: Harim Plus
3
- emoji: 🌖
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 3.9.1
8
- app_file: app.py
9
- pinned: false
10
- license: bsd-3-clause
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HaRiM+
2
+ **HaRiM+: Evaluating Summary Quality with Hallucination Risk, accepted at AACL-22 [paper](https://arxiv.org/abs/2211.12118).** <br />
3
+ <br />
4
+ HaRiM+ is reference-less metric for summarization task which hurls the power of summarization model to estimate the quality of the summary-article pair. <br />
5
+ Note that this metric is reference-free and do not require training. It is ready to go without reference text to compare with the generation nor any model training for scoring.
6
+
7
+ ## Quick Start
8
+ ### install
9
+ ```bash
10
+ # assumes torch, transformers, pandas, tqdm, fire and datasets are installed
11
+ pip install evaluate
12
+ # pip install -r requirments.txt
13
+ ```
14
+ ### example
15
+ ```python
16
+ import evaluate
17
+ from pprint import pprint
18
+
19
+ # example from the paper
20
+ art = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose 5-1 drubbing of Spain in the group stage in Brazil last year marked the end of the Iberian nation's six-year domination of the world game, scored two early goals at the Amsterdam Arena and held on against some determined Spain pressure in the second half for a 2-0 success. They became the first team to inflict two defeats on Del Bosque since he took over in 2008 but the gruff 64-year-old had used the match to try out several new faces and he fielded a largely experimental, second-string team. Stefan de Vrij (right) headed Holland in front against Spain at the Amsterdam Arena on Tuesday Gerard Pique (left) could do nothing to stop Davy Klaassen doubling the Dutch advantage Malaga forward Juanmi and Sevilla midfielder Vitolo became the 55th and 56th players to debut under Del Bosque, while the likes of goalkeeper David de Gea, defenders Raul Albiol, Juan Bernat and Dani Carvajal and midfielder Mario Suarez all started the game. 'The national team's state of health is good,' centre back Gerard Pique told reporters. 'We are in a process where players are coming into the team and gathering experience,' added the Barcelona defender. 'We are second in qualifying (for Euro 2016) and these friendly games are for experimenting. 'I am not that worried about this match because we lost friendlies in previous years and then ended up winning titles.' David de Gea was given a start by Vicente del Bosque but could not keep out De Vrij's header here Dani Carvajal (centre) was another squad player given a chance to impress against Holland Del Bosque will be confident he can find the right mix of players to secure Spain's berth at Euro 2016 in France next year, when they will be chasing an unprecedented third straight title. Slovakia are the surprise leaders in qualifying Group C thanks to a 2-1 win over Spain in Zilina in October and have a maximum 15 points from five of 10 matches. Spain are second on 12 points, three ahead of Ukraine, who they beat 1-0 in Seville on Friday. Del Bosque's side host Slovakia in September in a match that could decide who goes through to the finals as group winners. 'The team is in good shape,' forward Pedro told reporters. 'We have a very clear idea of our playing style and we are able to count on people who are gradually making a place for themselves in the team.'"""
21
+
22
+ summaries = [
23
+ "holland beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
24
+ "holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque will be hoping to find the right mix of players to the world cup . gerard pique could make the right mix of players to the tournament .",
25
+ "del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
26
+ "holland could not beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
27
+ ]
28
+ articles = [art] * len(summaries)
29
+
30
+ scorer = evaluate.load('NCSOFT/harim_plus')
31
+ scores = scorer.compute(predictions = summaries, references = articles) # use_aggregator=False, tokenwise_score=False, bsz=32)
32
+ pprint(scores['harim+'])
33
+ >>> [1.8230078220367432,
34
+ 1.5361897945404053,
35
+ 1.806436538696289,
36
+ 1.7360382080078125
37
+ ]
38
+
39
+ ```
40
+
41
+ ## Powering HaRiM+ score with other summarization model checkpoints
42
+ HaRiM+ accepts any checkpoint compatible with <code>transformers.AutoModelForSeq2SeqLM</code> which is encoder-decoder model. <br />
43
+ In principle the HaRiM+ score expected to work on machine-translation too. It works but not better than BARTScore (Yuan et al.) while it excels in summarization task.
44
+
45
+ ```python
46
+
47
+ newharim = evaluate.load('NCSOFT/harim_plus', pretrained_name='local or ckpt name available')#, tokenizer=custom_tokenizer)
48
+ ```
49
+
50
+ ## Speed and Resource requirements
51
+ HaRiM+ requires GPU usage for practical speed, but only loads encoder-decoder model of your choice (Default \= facebook\/bart\-large\-cnn). Empirically, resource requirements and speed is similar to BERTScore.
52
+
53
+ ## Citation
54
+ Please cite as follows
55
+ ```
56
+ @inproceedings{son-etal-2022-harim,
57
+ title = "{H}a{R}i{M}$^+$: Evaluating Summary Quality with Hallucination Risk",
58
+ author = "Son, Seonil (Simon) and
59
+ Park, Junsoo and
60
+ Hwang, Jeong-in and
61
+ Lee, Junghwa and
62
+ Noh, Hyungjong and
63
+ Lee, Yeonsoo",
64
+ booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
65
+ month = nov,
66
+ year = "2022",
67
+ address = "Online only",
68
+ publisher = "Association for Computational Linguistics",
69
+ url = "https://aclanthology.org/2022.aacl-main.66",
70
+ pages = "895--924",
71
+ abstract = "One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.",
72
+ }
73
+ ```
__pycache__/harim_scorer.cpython-39.pyc ADDED
Binary file (7.05 kB). View file
 
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("NCSOFT/harim_plus")
6
+ launch_gradio_widget(module)
harim_plus.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import datasets
2
+ import evaluate
3
+
4
+ from harim_scorer import Harimplus_Scorer
5
+
6
+
7
+
8
+ logger = evaluate.logging.get_logger(__name__)
9
+
10
+ CODEBASE_URL=''
11
+ PAPER_URL='TBA'
12
+
13
+ _CITATION = """\
14
+ @inproceedings{harimplus,
15
+ title={HaRiM+: Evaluating Summary Quality with Hallucination Risk},
16
+ author={Seonil Son, Junsoo Park, Jeong-in Hwang, Hyungjong Noh, Yeonsoo Lee},
17
+ booktitle={AACL},
18
+ year={2022},
19
+ url={TBA}
20
+ }
21
+ """
22
+
23
+ _DESCRIPTION = """\
24
+ HaRiM+ is a reference-less (i.e. scoring summary quality only requires an article) evaluation metric score for summarization task which hurls the power of summarization model.
25
+ It will work great ranking the summary-article pairs according to its quality.
26
+ Note that the score range is unbound.
27
+
28
+ Summarization model inside the HaRiM+ will read and evaluate how good the quality of a summary given the paired source article.
29
+
30
+ HaRiM+ is proved effective for benchmarking summarization systems (system-level performance) as well as ranking the article-summary pairs (segment-level performance) in comprehensive aspect such as factuality, consistency, coherency, fluency, and relevance. For details, refer to our paper published in AACL2022.
31
+ """
32
+
33
+ _KWARGS_DESCRIPTION = """
34
+ HaRiM+ score.
35
+ Args:
36
+ For scorer = evaluate.load():
37
+ `pretrained_name` (str or pathlib.Path): summarization model checkpoint or path, loaded by transformers.AutoModelForSeq2SeqLM.from_pretrained(). Defaults to Yale-LILY/brio-cnndm-uncased.
38
+ `tokenizer`: (use when your tokenizer cannot be loaded by from_pretrained)Tokenizer function compatible with transformers.PreTrainedTokenizer. It requires tokenizer.pad_token|eos_token|bos_token and tokenizer.__call__() method for HaRiM+ score computation.
39
+
40
+ For scorer.compute():
41
+ `predictions` (list of str): generated summaries
42
+ `references` (list of str): source articles to be summarized
43
+ `use_aggregator` (bool): if True, average of the scores are returned
44
+
45
+ Returns:
46
+ 'results' (dict): {
47
+ 'harim+' (List[float] or float): HaRiM+ score to use,
48
+ 'harim' (List[float] or float): HaRiM term for computing the score above,
49
+ 'log_ppl' (List[float] or float): Log perplexity term. Same as (Yuan et al., NeurIPS 2021),
50
+ 'lambda' (float): (recommend not to modify this) Balancing coeff. for computing harim+ from harim and log_ppl.
51
+ }
52
+
53
+ Examples:
54
+ >>> summaries = ["hello there", "hello there"]
55
+ >>> articles = ["hello, this is the article to be summarized", "hello, this is the article to be summarized"]
56
+ >>> scorer = evaluate.load("NCSOFT/harim_plus") #, pretrained_name='PRETRAINEDNAME', tokenizer=TOKENIZER # optional
57
+ >>> results = scorer.compute(predictions=summaries, references=articles) # use_aggregator=True # optional
58
+ >>> print([round(v, 2) for v in results["harim+"]])
59
+ [0.4, 0.4]
60
+ """
61
+
62
+
63
+
64
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
65
+ class Harimplus(evaluate.Metric):
66
+ def __init__(self,
67
+ pretrained_name='facebook/bart-large-cnn',
68
+ tokenizer=None,
69
+ device='cuda',
70
+ **kwargs
71
+ ):
72
+ super().__init__(**kwargs)
73
+ self.myconfig = dict(
74
+ pretrained_name=pretrained_name,
75
+ tokenizer=tokenizer,
76
+ device=device,
77
+ )
78
+
79
+ def _info(self):
80
+ return evaluate.MetricInfo(
81
+ description=_DESCRIPTION,
82
+ citation=_CITATION,
83
+ homepage=CODEBASE_URL,
84
+ inputs_description=_KWARGS_DESCRIPTION,
85
+ features=datasets.Features(
86
+ {
87
+ "predictions": datasets.Value("string", id="sequence"),
88
+ "references": datasets.Value("string", id="sequence"),
89
+ }
90
+ ),
91
+ codebase_urls=[CODEBASE_URL],
92
+ reference_urls=[CODEBASE_URL, PAPER_URL],
93
+ )
94
+
95
+ def _download_and_prepare(self, dl_manager):
96
+ pretrained_name = self.myconfig['pretrained_name']
97
+ is_custom_tokenzer = self.myconfig['tokenizer'] is not None
98
+ logger.warning(
99
+ "Loading HaRiM+ score"
100
+ f"\tpretrained_name = {pretrained_name}"
101
+ )
102
+ if is_custom_tokenizer:
103
+ logger.warning(
104
+ f"tokenizer is overriden by \n\tself.myconfig['tokenizer']"
105
+ )
106
+ logger.warning(
107
+ "You can change checkpoints with `pretrained_name` kwarg in evaluate.load. Strongly recommend to use *-large or larger ones."
108
+ "Refrain from using checkpoints trained on noisy corpus such as bbc-XSUM.")
109
+
110
+ # download the model checkpoint specified by self.myconfig_name and set up the scorer
111
+ self.scorer = score.Harimplus_Scorer(**self.myconfig)
112
+
113
+ def _compute(self, predictions=None,
114
+ references=None,
115
+ use_aggregator=False,
116
+ bsz=32,
117
+ tokenwise_score=False):
118
+ summaries = predictions
119
+ articles = references
120
+ scores = self.scorer.compute(predictions=summaries, references=articles, use_aggregator=use_aggregator, bsz=bsz, tokenwise_score=tokenwise_score)
121
+ return scores
harim_scorer.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn.functional as F
3
+ from transformers import (AutoModelForSeq2SeqLM,
4
+ AutoTokenizer,
5
+ PreTrainedTokenizer,
6
+ PreTrainedTokenizerFast)
7
+ import evaluate
8
+
9
+ from fire import Fire
10
+ import pandas as pd
11
+ from tqdm import tqdm
12
+ import json
13
+
14
+ from typing import List, Dict, Union
15
+ from collections import defaultdict
16
+ from functools import partial
17
+ from pprint import pprint
18
+
19
+ from ipdb import set_trace
20
+
21
+ class Harimplus_Scorer:
22
+ def __init__(self,
23
+ pretrained_name:str='none',
24
+ tokenizer:Union[PreTrainedTokenizer, PreTrainedTokenizerFast]=None,
25
+ mixing_factor:float=7., # same as lambda in the paper
26
+ device:str='cuda',
27
+
28
+ src_maxlen=1024,
29
+ tgt_maxlen=110,
30
+ ):
31
+ self._pretrained_name = pretrained_name
32
+ self._lambda = mixing_factor
33
+
34
+ self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
35
+ self._encdec_model = AutoModelForSeq2SeqLM.from_pretrained(self._pretrained_name)
36
+ if tokenizer is None:
37
+ self._tokenizer = AutoTokenizer.from_pretrained(self._pretrained_name)
38
+ else:
39
+ self._tokenizer = tokenizer
40
+ self._encdec_model.to(self._device)
41
+ self._encdec_model.eval()
42
+
43
+ self._src_maxlen = src_maxlen
44
+ self._tgt_maxlen = tgt_maxlen
45
+
46
+
47
+
48
+ def _prep_input(self, src_tgt_txts, src_or_tgt='src'):
49
+ L = self._src_maxlen if src_or_tgt=='src' else self._tgt_maxlen
50
+ if isinstance(src_tgt_txts, pd.Series):
51
+ src_tgt_txts=src_tgt_txts.tolist()
52
+ if src_or_tgt == 'src':
53
+ src_tgt_txts = [ s.replace("\n", " ") for s in src_tgt_txts ]
54
+ return self._tokenizer(src_tgt_txts, padding=True, truncation=True, max_length=L, return_tensors='pt') # ModelInput dataclass
55
+
56
+
57
+ '''below are helper functions w/o dependency to the self, but included inside the class for ease of use'''
58
+ def likelihoods(self, logits, force_decode_indices, tgt_mask):
59
+ probs = F.softmax(logits, dim=-1)
60
+ probs_force_decode_ = probs.gather(-1, force_decode_indices.unsqueeze(-1)).squeeze()
61
+ probs_force_decode= probs_force_decode_ * tgt_mask
62
+ assert probs_force_decode.shape == force_decode_indices.shape
63
+ return probs_force_decode
64
+
65
+ def log_likelihoods(self, logits, force_decode_indices, tgt_mask):
66
+ ll = F.log_softmax(logits, dim=-1)
67
+ ll_force_decode_ = ll.gather(-1, force_decode_indices.unsqueeze(-1)).squeeze()
68
+ ll_force_decode = ll_force_decode_ * tgt_mask
69
+
70
+ return ll_force_decode
71
+
72
+ def harim(self, s2s_logits, lm_logits, force_decode_indices, tgt_mask ):
73
+ p_s2s, p_lm = self.likelihoods(s2s_logits, force_decode_indices, tgt_mask), \
74
+ self.likelihoods(lm_logits, force_decode_indices, tgt_mask)
75
+
76
+ delta = p_s2s - p_lm
77
+ margin_linear = (1-delta) / 2
78
+ harim = -(1-p_s2s) * margin_linear + 1
79
+ return harim # this is -1 * hallucination risk
80
+
81
+ def make_minibatches(self, exs:List[str], bsz:int=32):
82
+ idx=0
83
+ minibatches = []
84
+ while True:
85
+ start = idx
86
+ end = idx+bsz
87
+ if start >= len(exs):
88
+ break
89
+
90
+ minibatches.append( exs[start:end] )
91
+ idx += bsz
92
+ return minibatches
93
+
94
+ def make_empty_minibatches(self, minibatches:List[List[str]]):
95
+ e_minibatches = minibatches.copy()
96
+ for i, mb in enumerate(e_minibatches):
97
+ e_minibatches[i] = ['' for ex in mb]
98
+ return e_minibatches
99
+
100
+
101
+ def compute(self, predictions:List[str],
102
+ references:List[str],
103
+ bsz:int=32,
104
+ use_aggregator:bool=False,
105
+ tokenwise_score:bool=False,
106
+ ):
107
+ '''
108
+ returns harim+ score (List[float]) for predictions (summaries) and references (articles)
109
+ **Note**
110
+ - here, predictions = generated summaries to be evaluated, references = article to be summarized (but to follow the convention of the evaluate, we named kwarg as "references")
111
+ - log_ppl equals to bartscore (yuan et al., neurips 2021)
112
+
113
+ if tokenwise_score:
114
+ returns minibatch chunks of harim+ scores and log-likelihoods with tokenized predictions (List[str])
115
+ if use_aggregator:
116
+ returning scores are aggregated (mean) over given test set
117
+ '''
118
+
119
+
120
+ # tokenize/prep src/tgts
121
+ make_minibatches_bsz = partial(self.make_minibatches, bsz=bsz)
122
+ b_srcs, b_tgts = map(make_minibatches_bsz, [predictions, references])
123
+ b_emps = self.make_empty_minibatches(b_srcs)
124
+
125
+ scores=defaultdict(list)
126
+ for mini_s, mini_e, mini_t in tqdm(zip(b_srcs, b_emps, b_tgts), total=len(b_tgts), desc=f"computing HaRiM+ {bsz=}, core={self._pretrained_name}"):
127
+ src_in = self._prep_input(mini_s, src_or_tgt='src')
128
+ emp_in = self._prep_input(mini_e, src_or_tgt='src')
129
+ tgt_in = self._prep_input(mini_t, src_or_tgt='tgt')
130
+ if emp_in.input_ids.shape[-1]==0: # emp_in.input_ids.shape == (32,0)
131
+ boseos = f"{self._tokenizer.bos_token}{self._tokenizer.eos_token}"
132
+ mini_e_ = [boseos for _ in range(len(mini_e))]
133
+ emp_in = self._prep_input( mini_e_, src_or_tgt='src' )
134
+
135
+ # if mini_s == b_srcs[0]:
136
+ # normal = src_in
137
+ # if mini_s == b_srcs[-1]:
138
+ # trailing = src_in
139
+ # set_trace()
140
+
141
+ src_in.data['labels'] = tgt_in.input_ids
142
+ emp_in.data['labels'] = tgt_in.input_ids
143
+ # print(f"{emp_in.data['labels']=}")
144
+ # set_trace()
145
+ tgt_mask = tgt_in.attention_mask
146
+
147
+ assert (tgt_in.attention_mask == (tgt_in.input_ids != self._tokenizer.pad_token_id)).all()
148
+ # src_in.data['decoder_input_ids'] = tgt_in.input_ids
149
+ # src_in.data['decoder_attention_mask'] = tgt_in.attention_mask
150
+ src_in = src_in.to(self._device)
151
+ emp_in = emp_in.to(self._device)
152
+ tgt_in = tgt_in.to(self._device)
153
+ tgt_mask = tgt_mask.to(self._device)
154
+
155
+ with torch.no_grad():
156
+ # token_type_ids attribute causes error
157
+ s2s_logits = self._encdec_model.forward(
158
+ input_ids = src_in.input_ids,
159
+ attention_mask = src_in.attention_mask,
160
+ labels = tgt_in.input_ids,
161
+ # decoder_input_ids = tgt_in.input_ids,
162
+ # decoder_attention_mask = tgt_in.attention_mask,
163
+ return_dict=True).logits
164
+ lm_logits = self._encdec_model.forward(
165
+ input_ids = emp_in.input_ids,
166
+ attention_mask = emp_in.attention_mask,
167
+ labels = tgt_in.input_ids,
168
+ # decoder_input_ids = tgt_in.input_ids,
169
+ # decoder_attention_mask = tgt_in.attention_mask,
170
+ return_dict=True).logits
171
+ sent_lengths = tgt_mask.sum(-1)
172
+ ll_tok = self.log_likelihoods(s2s_logits, src_in.labels, tgt_mask)
173
+ ll = ll_tok.sum(-1) / sent_lengths
174
+
175
+ harim_tok = self.harim(s2s_logits, lm_logits, src_in.labels, tgt_mask)
176
+ harim = harim_tok.sum(-1) / sent_lengths
177
+
178
+ harim_plus_normalized = ll + self._lambda * harim # loglikelihood + lambda * negative_harim (negative harim=-1* risk)
179
+
180
+ scores['harim+'].extend(harim_plus_normalized.tolist())
181
+ scores['harim'].extend(harim.tolist())
182
+ scores['log_ppl'].extend(ll.tolist())
183
+
184
+ if tokenwise_score:
185
+ scores['tok_harim+'].append(harim_tok*self._lambda + ll_tok)
186
+ scores['tok_predictions'].append( [self._tokenizer.convert_ids_to_token(idxs) for idxs in src_in.labels] )
187
+
188
+ if use_aggregator: # after
189
+ for k, v in scores.items():
190
+ if not k.startswith('tok_'):
191
+ scores[k] = sum(v)/len(v) # aggregate (mean)
192
+ scores['lambda'] = self._lambda
193
+ return scores
194
+
195
+
196
+
197
+ def test(bsz = 16, pretrained_name='facebook/bart-large-cnn', tokenizer=None):
198
+ if tokenizer is None:
199
+ scorer = Harimplus_Scorer(pretrained_name=pretrained_name)
200
+ else:
201
+ scorer = Harimplus_Scorer(pretrained_name=pretrained_name, tokenizer=tokenizer)
202
+
203
+ art1 = """The respected law professor from Philadelphia now being investigated after allegedly emailing students a link to pornographic footage, was once a contestant on Who Wants to Be a Millionaire, it has emerged. Lisa McElroy, a 50-year-old Drexel professor, appeared on the show in 2010 while it was still hosted my Meredith Vieira. And like her apparent March 31 email mishap, her game show appearance ended with a very public mistake. McElroy, who teaches legal writing, got tripped up on the $12,500 level after flying through the first few questions, notes Philly.com. Wishes she was a millionaire: Drexel law profesor professor Lisa McElroy allegedly sent a link to a pornographic website to her students. In 2010, she appeared on the TV game show Who Wants to Be a Milionaire Mother of two: The mother of two shared an anecdote with then-host Meredith Vieira about having to scramble to find a babysitter for her kids and someone to teach her class after learning she was to appear on the show just two days before taping Lost it: McElroy was tripped up on the $12,500 question. Despite having used two lifelines, she answered wrong and walked away with around $5,000 The questions read: 'As a result of General Motor’s bankruptcy declaration in 2009, what foreign government became one of its largest shareholders?' Even after using two of her lifelines to narrow down the answer, McElroy answered China, which was incorrect. The correct answer was Canada. She walked away with around $5,000. McElroy, who is a children's book and biography author, is apparently also a mother. She opened the appearance by sharing an anecdote with Vieira about having to scramble to find a babysitter after being informed she was chosen to be on Millionaire jsut two days prior to taping. She's accused of sending the inappropriate message this past March 31 under the subject line: 'Great article on writing briefs.' However, when recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act'. Lisa McElroy, 50, who teaches legal writing at Drexel University, reportedly sent the inappropriate message on March 31 baring the subject line: 'Great article on writing briefs' Following a number of complaints, the college issued an apology to students. The message read: 'As you may be aware, some students erroneously received an email this morning directing them to a... post that included some inappropriate material. 'We take this matter seriously and apologize for any upset it may have caused.' The university says federal law requires it investigate all reports of inappropriate behaviors of a sexual nature. McElroy did not immediately respond to an email sent to her university account by the Associated Press. When recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act' It's not the first time the married mother-of-two has appeared in the spotlight. She is also an accomplished author with a number of published biographies and children's books. On her website, www.lisamcelroy.com, she describes herself as a 'Supreme Court junkie.' She adds that her favorites ways of relaxing include 'crawling under the covers with a dog or two and a really good book' or 'hanging out' with her two adolescent daughters. Regarding the recent email scandal, David Lat - a lawyer and legal commenter -suggests she could have been 'hacked' or made a 'copy/paste error'. While an internal investigation gets underway, it's been reported that McElroy has been placed on administrative leave. While an internal investigation gets underway, it's been reported that McElroy has been placed on administrative leave from Drexel University (seen above)"""
204
+ art2 = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose 5-1 drubbing of Spain in the group stage in Brazil last year marked the end of the Iberian nation's six-year domination of the world game, scored two early goals at the Amsterdam Arena and held on against some determined Spain pressure in the second half for a 2-0 success. They became the first team to inflict two defeats on Del Bosque since he took over in 2008 but the gruff 64-year-old had used the match to try out several new faces and he fielded a largely experimental, second-string team. Stefan de Vrij (right) headed Holland in front against Spain at the Amsterdam Arena on Tuesday Gerard Pique (left) could do nothing to stop Davy Klaassen doubling the Dutch advantage Malaga forward Juanmi and Sevilla midfielder Vitolo became the 55th and 56th players to debut under Del Bosque, while the likes of goalkeeper David de Gea, defenders Raul Albiol, Juan Bernat and Dani Carvajal and midfielder Mario Suarez all started the game. 'The national team's state of health is good,' centre back Gerard Pique told reporters. 'We are in a process where players are coming into the team and gathering experience,' added the Barcelona defender. 'We are second in qualifying (for Euro 2016) and these friendly games are for experimenting. 'I am not that worried about this match because we lost friendlies in previous years and then ended up winning titles.' David de Gea was given a start by Vicente del Bosque but could not keep out De Vrij's header here Dani Carvajal (centre) was another squad player given a chance to impress against Holland Del Bosque will be confident he can find the right mix of players to secure Spain's berth at Euro 2016 in France next year, when they will be chasing an unprecedented third straight title. Slovakia are the surprise leaders in qualifying Group C thanks to a 2-1 win over Spain in Zilina in October and have a maximum 15 points from five of 10 matches. Spain are second on 12 points, three ahead of Ukraine, who they beat 1-0 in Seville on Friday. Del Bosque's side host Slovakia in September in a match that could decide who goes through to the finals as group winners. 'The team is in good shape,' forward Pedro told reporters. 'We have a very clear idea of our playing style and we are able to count on people who are gradually making a place for themselves in the team.'"""
205
+
206
+ summaries = [
207
+ "lisa mcelroy , 50 , who teaches legal writing at drexel university , reportedly sent the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # reference 2.3270
208
+ "lisa mcelroy, a 50-year-old drexel professor, appeared on the show in 2010 while it was still hosted my meredith vieira. she's accused of sending the inappropriate message this past march 31 under the subject line: 'great article on writing briefs' when recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act' the married mother-of-two has been placed on administrative leave.", # BART-large+cnn 4.9714
209
+ "lisa mcelroy , 50 , who teaches legal writing at drexel university , appeared on the show in 2010 while it was still hosted my meredith vieira . she got tripped up on the $ 12,500 level after flying through the first few questions , philly.com reports . mcelroy answered wrong and walked away with around $ 5,000 .", # BERTSUM=Factual 3.2028
210
+
211
+ "lisa mcelroy , 50 , who teaches legal writing at philadelphia university , reportedly sent the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # wrong subj (philadelphia) 2.2122
212
+ "lisa mcelroy , 50 , who teaches legal writing at drexel university , reportedly did not send the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # negation 2.2022
213
+
214
+ "holland beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference
215
+ "holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque will be hoping to find the right mix of players to the world cup . gerard pique could make the right mix of players to the tournament .",# summary (factuality = 0, rnn)
216
+ "del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference + wrong subj
217
+ "holland could not beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference + negation
218
+
219
+
220
+
221
+ ]
222
+ articles = [ art1 ]*5 + [art2 ]*4
223
+ # set_trace()
224
+ hp_score = scorer.compute(predictions=summaries, references=articles, use_aggregator=False, bsz=bsz)
225
+ # pprint(f"{articles=}")
226
+ # pprint(f"{summaries=}")
227
+ pprint(hp_score)
228
+
229
+
230
+
231
+ '''
232
+ ## drexel example
233
+ # reference 2.3270
234
+ # BART-large+cnn 4.9714
235
+ # BERTSUM=Factual 3.2028
236
+ # ref + wrong subj (philadelphia) 2.2122
237
+ # ref + negation 2.2022
238
+
239
+
240
+ 'harim+': [1.6270232200622559,
241
+ 1.7585878372192383,
242
+ 1.3859858512878418,
243
+ 1.5434350967407227,
244
+ 1.609492301940918],
245
+
246
+
247
+
248
+ ## main table result
249
+
250
+ 1.6247 (reference, factual)
251
+ 0.1173 (rnn, unfactual)
252
+ 1.3229 (ref + wrong subj)
253
+ 1.4132 (ref + negation)
254
+
255
+ 'harim+': [1.8230078220367432,
256
+ 1.5361897945404053,
257
+ 1.806436538696289,
258
+ 1.7360382080078125],
259
+
260
+ '''
261
+
262
+ if __name__ == '__main__':
263
+ Fire(test)
requirements.txt ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiohttp==3.8.3
2
+ aiosignal==1.3.1
3
+ asttokens==2.1.0
4
+ async-timeout==4.0.2
5
+ attrs==22.1.0
6
+ backcall==0.2.0
7
+ certifi==2022.9.24
8
+ charset-normalizer==2.1.1
9
+ datasets==2.6.1
10
+ decorator==5.1.1
11
+ dill==0.3.5.1
12
+ evaluate==0.3.0
13
+ executing==1.2.0
14
+ filelock==3.8.0
15
+ fire==0.4.0
16
+ frozenlist==1.3.3
17
+ fsspec==2022.10.0
18
+ huggingface-hub==0.10.1
19
+ idna==3.4
20
+ ipython==8.6.0
21
+ jedi==0.18.1
22
+ matplotlib-inline==0.1.6
23
+ multidict==6.0.2
24
+ multiprocess==0.70.13
25
+ numpy==1.23.4
26
+ packaging==21.3
27
+ pandas==1.5.1
28
+ parso==0.8.3
29
+ pexpect==4.8.0
30
+ pickleshare==0.7.5
31
+ prompt-toolkit==3.0.32
32
+ ptyprocess==0.7.0
33
+ pure-eval==0.2.2
34
+ pyarrow==10.0.0
35
+ Pygments==2.13.0
36
+ pyparsing==3.0.9
37
+ python-dateutil==2.8.2
38
+ pytz==2022.6
39
+ PyYAML==6.0
40
+ regex==2022.10.31
41
+ requests==2.28.1
42
+ responses==0.18.0
43
+ six==1.16.0
44
+ stack-data==0.6.0
45
+ termcolor==2.1.0
46
+ tokenizers==0.13.2
47
+ toml==0.10.2
48
+ torch==1.12.1+cu113
49
+ tqdm==4.64.1
50
+ traitlets==5.5.0
51
+ transformers==4.24.0
52
+ typing_extensions==4.4.0
53
+ urllib3==1.26.12
54
+ wcwidth==0.2.5
55
+ xxhash==3.1.0
56
+ yarl==1.8.1