DeDeckerThomas
commited on
Commit
•
47bf85b
1
Parent(s):
1db7ffa
Update README.md
Browse files
README.md
CHANGED
@@ -121,14 +121,65 @@ For more in detail information, you can take a look at the training notebook (li
|
|
121 |
| Early Stopping Patience | 1 |
|
122 |
|
123 |
### Preprocessing
|
|
|
124 |
```python
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
```
|
127 |
-
|
128 |
### Postprocessing
|
|
|
129 |
```python
|
130 |
-
|
|
|
131 |
```
|
|
|
132 |
## 📝 Evaluation results
|
133 |
|
134 |
One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
|
|
|
121 |
| Early Stopping Patience | 1 |
|
122 |
|
123 |
### Preprocessing
|
124 |
+
The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
|
125 |
```python
|
126 |
+
def pre_process_keyphrases(text_ids, kp_list):
|
127 |
+
kp_order_list = []
|
128 |
+
kp_set = set(kp_list)
|
129 |
+
text = tokenizer.decode(
|
130 |
+
text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
|
131 |
+
)
|
132 |
+
text = text.lower()
|
133 |
+
for kp in kp_set:
|
134 |
+
kp = kp.strip()
|
135 |
+
kp_index = text.find(kp.lower())
|
136 |
+
kp_order_list.append((kp_index, kp))
|
137 |
+
kp_order_list.sort()
|
138 |
+
present_kp, absent_kp = [], []
|
139 |
+
for kp_index, kp in kp_order_list:
|
140 |
+
if kp_index < 0:
|
141 |
+
absent_kp.append(kp)
|
142 |
+
else:
|
143 |
+
present_kp.append(kp)
|
144 |
+
return present_kp, absent_kp
|
145 |
+
|
146 |
+
def preprocess_fuction(samples):
|
147 |
+
processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
|
148 |
+
for i, sample in enumerate(samples[dataset_document_column]):
|
149 |
+
input_text = " ".join(sample)
|
150 |
+
inputs = tokenizer(
|
151 |
+
input_text,
|
152 |
+
padding="max_length",
|
153 |
+
truncation=True,
|
154 |
+
)
|
155 |
+
present_kp, absent_kp = pre_process_keyphrases(
|
156 |
+
text_ids=inputs["input_ids"],
|
157 |
+
kp_list=samples["extractive_keyphrases"][i]
|
158 |
+
+ samples["abstractive_keyphrases"][i],
|
159 |
+
)
|
160 |
+
keyphrases = present_kp
|
161 |
+
keyphrases += absent_kp
|
162 |
+
target_text = f" {keyphrase_sep_token} ".join(keyphrases)
|
163 |
+
with tokenizer.as_target_tokenizer():
|
164 |
+
targets = tokenizer(
|
165 |
+
target_text, max_length=40, padding="max_length", truncation=True
|
166 |
+
)
|
167 |
+
targets["input_ids"] = [
|
168 |
+
(t if t != tokenizer.pad_token_id else -100)
|
169 |
+
for t in targets["input_ids"]
|
170 |
+
]
|
171 |
+
for key in inputs.keys():
|
172 |
+
processed_samples[key].append(inputs[key])
|
173 |
+
processed_samples["labels"].append(targets["input_ids"])
|
174 |
+
return processed_samples
|
175 |
```
|
|
|
176 |
### Postprocessing
|
177 |
+
For the post-processing, you will need to split the string based on the keyphrase separator.
|
178 |
```python
|
179 |
+
def extract_keyphrases(examples):
|
180 |
+
return [example.split(keyphrase_sep_token) for example in examples]
|
181 |
```
|
182 |
+
|
183 |
## 📝 Evaluation results
|
184 |
|
185 |
One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
|