DeDeckerThomas commited on
Commit
e2808d7
·
1 Parent(s): 23eccc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -1
README.md CHANGED
@@ -155,7 +155,11 @@ You can find more information here: https://huggingface.co/datasets/midas/inspec
155
  ## 👷‍♂️ Training procedure
156
  For more in detail information, you can take a look at the training notebook (link incoming).
157
 
158
- ### Preprocessing
 
 
 
 
159
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
160
  ```python
161
  def preprocess_fuction(all_samples_per_split):
@@ -192,6 +196,40 @@ def preprocess_fuction(all_samples_per_split):
192
  tokenized_samples["labels"] = total_adjusted_labels
193
  return tokenized_samples
194
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  ## 📝Evaluation results
196
 
197
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
 
155
  ## 👷‍♂️ Training procedure
156
  For more in detail information, you can take a look at the training notebook (link incoming).
157
 
158
+ ### Training parameters
159
+
160
+ | Parameter | Value |
161
+ | --------- | ------------------------------- |
162
+
163
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
164
  ```python
165
  def preprocess_fuction(all_samples_per_split):
 
196
  tokenized_samples["labels"] = total_adjusted_labels
197
  return tokenized_samples
198
  ```
199
+
200
+ ### Postprocessing
201
+ For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive B and Is. As last you strip the keyphrase to ensure all spaces are removed.
202
+ ```python
203
+ # Define post_process functions
204
+ def concat_tokens_by_tag(keyphrases):
205
+ keyphrase_tokens = []
206
+ for id, label in keyphrases:
207
+ if label == "B":
208
+ keyphrase_tokens.append([id])
209
+ elif label == "I":
210
+ if len(keyphrase_tokens) > 0:
211
+ keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
212
+ return keyphrase_tokens
213
+
214
+
215
+ def extract_keyphrases(example, predictions, tokenizer, index=0):
216
+ keyphrases_list = [
217
+ (id, idx2label[label])
218
+ for id, label in zip(
219
+ np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
220
+ )
221
+ if idx2label[label] in ["B", "I"]
222
+ ]
223
+
224
+ processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
225
+ extracted_kps = tokenizer.batch_decode(
226
+ processed_keyphrases,
227
+ skip_special_tokens=True,
228
+ clean_up_tokenization_spaces=True,
229
+ )
230
+ return np.unique([kp.strip() for kp in extracted_kps])
231
+
232
+ ```
233
  ## 📝Evaluation results
234
 
235
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.