ml6team
/

keyphrase-generation-t5-small-openkp

Text2Text Generation

keyphrase-generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

DeDeckerThomas commited on May 13, 2022

Commit

47bf85b

•

1 Parent(s): 1db7ffa

Update README.md

Files changed (1) hide show

README.md +54 -3

README.md CHANGED Viewed

@@ -121,14 +121,65 @@ For more in detail information, you can take a look at the training notebook (li
 | Early Stopping Patience | 1 |
 ### Preprocessing
 ```python
 ```
 ### Postprocessing
 ```python
 ```
 ## 📝 Evaluation results
 One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.

 | Early Stopping Patience | 1 |
 ### Preprocessing
+The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
 ```python
+def pre_process_keyphrases(text_ids, kp_list):
+    kp_order_list = []
+    kp_set = set(kp_list)
+    text = tokenizer.decode(
+        text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
+    )
+    text = text.lower()
+    for kp in kp_set:
+        kp = kp.strip()
+        kp_index = text.find(kp.lower())
+        kp_order_list.append((kp_index, kp))
+    kp_order_list.sort()
+    present_kp, absent_kp = [], []
+    for kp_index, kp in kp_order_list:
+        if kp_index < 0:
+            absent_kp.append(kp)
+        else:
+            present_kp.append(kp)
+    return present_kp, absent_kp
+def preprocess_fuction(samples):
+    processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
+    for i, sample in enumerate(samples[dataset_document_column]):
+        input_text = " ".join(sample)
+        inputs = tokenizer(
+            input_text,
+            padding="max_length",
+            truncation=True,
+        )
+        present_kp, absent_kp = pre_process_keyphrases(
+            text_ids=inputs["input_ids"],
+            kp_list=samples["extractive_keyphrases"][i]
+            + samples["abstractive_keyphrases"][i],
+        )
+        keyphrases = present_kp
+        keyphrases += absent_kp
+        target_text = f" {keyphrase_sep_token} ".join(keyphrases)
+        with tokenizer.as_target_tokenizer():
+            targets = tokenizer(
+                target_text, max_length=40, padding="max_length", truncation=True
+            )
+            targets["input_ids"] = [
+                (t if t != tokenizer.pad_token_id else -100)
+                for t in targets["input_ids"]
+            ]
+        for key in inputs.keys():
+            processed_samples[key].append(inputs[key])
+        processed_samples["labels"].append(targets["input_ids"])
+    return processed_samples
 ```
 ### Postprocessing
+For the post-processing, you will need to split the string based on the keyphrase separator.
 ```python
+def extract_keyphrases(examples):
+    return [example.split(keyphrase_sep_token) for example in examples]
 ```
 ## 📝 Evaluation results
 One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.