DeDeckerThomas commited on
Commit
47bf85b
1 Parent(s): 1db7ffa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -121,14 +121,65 @@ For more in detail information, you can take a look at the training notebook (li
121
  | Early Stopping Patience | 1 |
122
 
123
  ### Preprocessing
 
124
  ```python
125
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ```
127
-
128
  ### Postprocessing
 
129
  ```python
130
-
 
131
  ```
 
132
  ## 📝 Evaluation results
133
 
134
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
 
121
  | Early Stopping Patience | 1 |
122
 
123
  ### Preprocessing
124
+ The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
125
  ```python
126
+ def pre_process_keyphrases(text_ids, kp_list):
127
+ kp_order_list = []
128
+ kp_set = set(kp_list)
129
+ text = tokenizer.decode(
130
+ text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
131
+ )
132
+ text = text.lower()
133
+ for kp in kp_set:
134
+ kp = kp.strip()
135
+ kp_index = text.find(kp.lower())
136
+ kp_order_list.append((kp_index, kp))
137
+ kp_order_list.sort()
138
+ present_kp, absent_kp = [], []
139
+ for kp_index, kp in kp_order_list:
140
+ if kp_index < 0:
141
+ absent_kp.append(kp)
142
+ else:
143
+ present_kp.append(kp)
144
+ return present_kp, absent_kp
145
+
146
+ def preprocess_fuction(samples):
147
+ processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
148
+ for i, sample in enumerate(samples[dataset_document_column]):
149
+ input_text = " ".join(sample)
150
+ inputs = tokenizer(
151
+ input_text,
152
+ padding="max_length",
153
+ truncation=True,
154
+ )
155
+ present_kp, absent_kp = pre_process_keyphrases(
156
+ text_ids=inputs["input_ids"],
157
+ kp_list=samples["extractive_keyphrases"][i]
158
+ + samples["abstractive_keyphrases"][i],
159
+ )
160
+ keyphrases = present_kp
161
+ keyphrases += absent_kp
162
+ target_text = f" {keyphrase_sep_token} ".join(keyphrases)
163
+ with tokenizer.as_target_tokenizer():
164
+ targets = tokenizer(
165
+ target_text, max_length=40, padding="max_length", truncation=True
166
+ )
167
+ targets["input_ids"] = [
168
+ (t if t != tokenizer.pad_token_id else -100)
169
+ for t in targets["input_ids"]
170
+ ]
171
+ for key in inputs.keys():
172
+ processed_samples[key].append(inputs[key])
173
+ processed_samples["labels"].append(targets["input_ids"])
174
+ return processed_samples
175
  ```
 
176
  ### Postprocessing
177
+ For the post-processing, you will need to split the string based on the keyphrase separator.
178
  ```python
179
+ def extract_keyphrases(examples):
180
+ return [example.split(keyphrase_sep_token) for example in examples]
181
  ```
182
+
183
  ## 📝 Evaluation results
184
 
185
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.