DeDeckerThomas commited on
Commit
a4f27ed
·
1 Parent(s): 1f32179

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -27
README.md CHANGED
@@ -9,7 +9,17 @@ datasets:
9
  metrics:
10
  - seqeval
11
  widget:
12
- - text: "Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a text. Since this is a time-consuming process, Artificial Intelligence is used to automate it. Currently, classical machine learning methods, that use statistics and linguistics, are widely used for the extraction process. The fact that these methods have been widely used in the community has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP, transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics and context of a document, which is quite an improvement."
 
 
 
 
 
 
 
 
 
 
13
  example_title: "Example 1"
14
  - text: "FoodEx is the largest trade exhibition for food and drinks in Asia, with about 70,000 visitors checking out the products presented by hundreds of participating companies. I was lucky to enter as press; otherwise, visitors must be affiliated with the food industry— and pay ¥5,000 — to enter. The FoodEx menu is global, including everything from cherry beer from Germany and premium Mexican tequila to top-class French and Chinese dumplings. The event was a rare chance to try out both well-known and exotic foods and even see professionals making them. In addition to booths offering traditional Japanese favorites such as udon and maguro sashimi, there were plenty of innovative twists, such as dorayaki , a sweet snack made of two pancakes and a red-bean filling, that came in coffee and tomato flavors. While I was there I was lucky to catch the World Sushi Cup Japan 2013, where top chefs from around the world were competing … and presenting a wide range of styles that you would not normally see in Japan, like the flower makizushi above."
15
  example_title: "Example 2"
@@ -23,18 +33,23 @@ model-index:
23
  type: midas/kptimes
24
  name: kptimes
25
  metrics:
26
- - type: seqeval
27
  value: 0.539
28
- name: F1-score
 
 
 
29
  ---
30
- # 🔑 Keyphrase Extraction model: distilbert-kptimes
31
- Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a text. Since this is a time-consuming process, Artificial Intelligence is used to automate it. Currently, classical machine learning methods, that use statistics and linguistics, are widely used for the extraction process. The fact that these methods have been widely used in the community has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP, transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics and context of a document, which is quite an improvement.
 
 
32
 
33
 
34
  ## 📓 Model Description
35
- This model is a fine-tuned distilbert model on the KPTimes dataset. More information can be found here: https://huggingface.co/distilbert-base-uncased.
36
 
37
- The model is fine-tuned as a token classification problem where the text is labeled using the BIO scheme.
38
 
39
  | Label | Description |
40
  | ----- | ------------------------------- |
@@ -42,14 +57,14 @@ The model is fine-tuned as a token classification problem where the text is labe
42
  | I-KEY | Inside a keyphrase |
43
  | O | Outside a keyphrase |
44
 
45
- ## ✋ Intended uses & limitations
46
  ### 🛑 Limitations
47
  * This keyphrase extraction model is very domain-specific and will perform very well on news articles from NY Times. It's not recommended to use this model for other domains, but you are free to test it out.
48
  * Limited amount of predicted keyphrases.
49
  * Only works for English documents.
50
- * For a custom model, please consult the training notebook for more information (link incoming).
51
 
52
- ### ❓ How to use
53
  ```python
54
  from transformers import (
55
  TokenClassificationPipeline,
@@ -72,7 +87,7 @@ class KeyphraseExtractionPipeline(TokenClassificationPipeline):
72
  def postprocess(self, model_outputs):
73
  results = super().postprocess(
74
  model_outputs=model_outputs,
75
- aggregation_strategy=AggregationStrategy.SIMPLE,
76
  )
77
  return np.unique([result.get("word").strip() for result in results])
78
 
@@ -86,20 +101,27 @@ extractor = KeyphraseExtractionPipeline(model=model_name)
86
  ```python
87
  # Inference
88
  text = """
89
- Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a text.
90
- Since this is a time-consuming process, Artificial Intelligence is used to automate it.
91
- Currently, classical machine learning methods, that use statistics and linguistics,
92
- are widely used for the extraction process. The fact that these methods have been widely used in the community
93
- has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP,
94
- transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics
95
- and context of a document, which is quite an improvement.
96
- """.replace(
97
- "\n", ""
98
- )
 
 
 
 
 
 
99
 
100
  keyphrases = extractor(text)
101
 
102
  print(keyphrases)
 
103
  ```
104
 
105
  ```
@@ -113,7 +135,7 @@ KPTimes is a keyphrase extraction/generation dataset consisting of 279,923 news
113
  You can find more information here: https://huggingface.co/datasets/midas/kptimes
114
 
115
  ## 👷‍♂️ Training procedure
116
- For more in detail information, you can take a look at the training notebook (link incoming).
117
 
118
  ### Training parameters
119
 
@@ -125,7 +147,26 @@ For more in detail information, you can take a look at the training notebook (li
125
 
126
  ### Preprocessing
127
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
 
128
  ```python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  def preprocess_fuction(all_samples_per_split):
130
  tokenized_samples = tokenizer.batch_encode_plus(
131
  all_samples_per_split[dataset_document_column],
@@ -159,10 +200,17 @@ def preprocess_fuction(all_samples_per_split):
159
  total_adjusted_labels.append(adjusted_label_ids)
160
  tokenized_samples["labels"] = total_adjusted_labels
161
  return tokenized_samples
 
 
 
 
 
 
 
162
  ```
163
 
164
- ### Postprocessing
165
- For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive Bs and Is. As last you strip the keyphrase to ensure all spaces are removed.
166
  ```python
167
  # Define post_process functions
168
  def concat_tokens_by_tag(keyphrases):
@@ -194,16 +242,17 @@ def extract_keyphrases(example, predictions, tokenizer, index=0):
194
  return np.unique([kp.strip() for kp in extracted_kps])
195
 
196
  ```
197
- ## 📝 Evaluation results
198
 
199
- One of the traditional evaluation methods is the precision, recall and F1-score @K,M where k is the number that stands for the first K predicted keyphrases and M for the average amount of predicted keyphrases.
 
 
200
  The model achieves the following results on the KPTimes test set:
201
 
202
  | Dataset | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
203
  |:-----------------:|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|
204
  | KPTimes Test Set | 0.19 | 0.36 | 0.23 | 0.10 | 0.37 | 0.15 | 0.35 | 0.37 | 0.33 |
205
 
206
- For more information on the evaluation process, you can take a look at the keyphrase extraction evaluation notebook.
207
 
208
  ## 🚨 Issues
209
  Please feel free to start discussions in the Community Tab.
 
9
  metrics:
10
  - seqeval
11
  widget:
12
+ - text: "Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a document.
13
+ Thanks to these keyphrases humans can understand the content of a text very quickly and easily without reading
14
+ it completely. Keyphrase extraction was first done primarily by human annotators, who read the text in detail
15
+ and then wrote down the most important keyphrases. The disadvantage is that if you work with a lot of documents,
16
+ this process can take a lot of time.
17
+
18
+ Here is where Artificial Intelligence comes in. Currently, classical machine learning methods, that use statistical
19
+ and linguistic features, are widely used for the extraction process. Now with deep learning, it is possible to capture
20
+ the semantic meaning of a text even better than these classical methods. Classical methods look at the frequency,
21
+ occurrence and order of words in the text, whereas these neural approaches can capture long-term semantic dependencies
22
+ and context of words in a text."
23
  example_title: "Example 1"
24
  - text: "FoodEx is the largest trade exhibition for food and drinks in Asia, with about 70,000 visitors checking out the products presented by hundreds of participating companies. I was lucky to enter as press; otherwise, visitors must be affiliated with the food industry— and pay ¥5,000 — to enter. The FoodEx menu is global, including everything from cherry beer from Germany and premium Mexican tequila to top-class French and Chinese dumplings. The event was a rare chance to try out both well-known and exotic foods and even see professionals making them. In addition to booths offering traditional Japanese favorites such as udon and maguro sashimi, there were plenty of innovative twists, such as dorayaki , a sweet snack made of two pancakes and a red-bean filling, that came in coffee and tomato flavors. While I was there I was lucky to catch the World Sushi Cup Japan 2013, where top chefs from around the world were competing … and presenting a wide range of styles that you would not normally see in Japan, like the flower makizushi above."
25
  example_title: "Example 2"
 
33
  type: midas/kptimes
34
  name: kptimes
35
  metrics:
36
+ - type: F1 (Seqeval)
37
  value: 0.539
38
+ name: F1 (Seqeval)
39
+ - type: F1@M
40
+ value: 0.328
41
+ name: F1@M
42
  ---
43
+ # 🔑 Keyphrase Extraction Model: distilbert-kptimes
44
+ Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a document. Thanks to these keyphrases humans can understand the content of a text very quickly and easily without reading it completely. Keyphrase extraction was first done primarily by human annotators, who read the text in detail and then wrote down the most important keyphrases. The disadvantage is that if you work with a lot of documents, this process can take a lot of time ⏳.
45
+
46
+ Here is where Artificial Intelligence 🤖 comes in. Currently, classical machine learning methods, that use statistical and linguistic features, are widely used for the extraction process. Now with deep learning, it is possible to capture the semantic meaning of a text even better than these classical methods. Classical methods look at the frequency, occurrence and order of words in the text, whereas these neural approaches can capture long-term semantic dependencies and context of words in a text.
47
 
48
 
49
  ## 📓 Model Description
50
+ This model uses [KBIR](https://huggingface.co/distilbert-base-uncased) as its base model and fine-tunes it on the [KPTimes dataset](https://huggingface.co/datasets/midas/kptimes).
51
 
52
+ Keyphrase extraction models are transformer models fine-tuned as a token classification problem where each word in the document is classified as being part of a keyphrase or not.
53
 
54
  | Label | Description |
55
  | ----- | ------------------------------- |
 
57
  | I-KEY | Inside a keyphrase |
58
  | O | Outside a keyphrase |
59
 
60
+ ## ✋ Intended Uses & Limitations
61
  ### 🛑 Limitations
62
  * This keyphrase extraction model is very domain-specific and will perform very well on news articles from NY Times. It's not recommended to use this model for other domains, but you are free to test it out.
63
  * Limited amount of predicted keyphrases.
64
  * Only works for English documents.
65
+ * For a custom model, please consult the [training notebook]() for more information.
66
 
67
+ ### ❓ How To Use
68
  ```python
69
  from transformers import (
70
  TokenClassificationPipeline,
 
87
  def postprocess(self, model_outputs):
88
  results = super().postprocess(
89
  model_outputs=model_outputs,
90
+ aggregation_strategy=AggregationStrategy.FIRST,
91
  )
92
  return np.unique([result.get("word").strip() for result in results])
93
 
 
101
  ```python
102
  # Inference
103
  text = """
104
+ Keyphrase extraction is a technique in text analysis where you extract the
105
+ important keyphrases from a document. Thanks to these keyphrases humans can
106
+ understand the content of a text very quickly and easily without reading it
107
+ completely. Keyphrase extraction was first done primarily by human annotators,
108
+ who read the text in detail and then wrote down the most important keyphrases.
109
+ The disadvantage is that if you work with a lot of documents, this process
110
+ can take a lot of time.
111
+
112
+ Here is where Artificial Intelligence comes in. Currently, classical machine
113
+ learning methods, that use statistical and linguistic features, are widely used
114
+ for the extraction process. Now with deep learning, it is possible to capture
115
+ the semantic meaning of a text even better than these classical methods.
116
+ Classical methods look at the frequency, occurrence and order of words
117
+ in the text, whereas these neural approaches can capture long-term
118
+ semantic dependencies and context of words in a text.
119
+ """.replace("\n", " ")
120
 
121
  keyphrases = extractor(text)
122
 
123
  print(keyphrases)
124
+
125
  ```
126
 
127
  ```
 
135
  You can find more information here: https://huggingface.co/datasets/midas/kptimes
136
 
137
  ## 👷‍♂️ Training procedure
138
+ For more in detail information, you can take a look at the [training notebook]().
139
 
140
  ### Training parameters
141
 
 
147
 
148
  ### Preprocessing
149
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
150
+
151
  ```python
152
+ from datasets import load_dataset
153
+ from transformers import AutoTokenizer
154
+
155
+ # Labels
156
+ label_list = ["B", "I", "O"]
157
+ lbl2idx = {"B": 0, "I": 1, "O": 2}
158
+ idx2label = {0: "B", 1: "I", 2: "O"}
159
+
160
+ # Tokenizer
161
+ tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", add_prefix_space=True)
162
+ max_length = 512
163
+
164
+ # Dataset parameters
165
+ dataset_full_name = "midas/kptimes"
166
+ dataset_subset = "raw"
167
+ dataset_document_column = "document"
168
+ dataset_biotags_column = "doc_bio_tags"
169
+
170
  def preprocess_fuction(all_samples_per_split):
171
  tokenized_samples = tokenizer.batch_encode_plus(
172
  all_samples_per_split[dataset_document_column],
 
200
  total_adjusted_labels.append(adjusted_label_ids)
201
  tokenized_samples["labels"] = total_adjusted_labels
202
  return tokenized_samples
203
+
204
+ # Load dataset
205
+ dataset = load_dataset(dataset_full_name, dataset_subset)
206
+
207
+ # Preprocess dataset
208
+ tokenized_dataset = dataset.map(preprocess_fuction, batched=True)
209
+
210
  ```
211
 
212
+ ### Postprocessing (Without Pipeline Function)
213
+ If you do not use the pipeline function, you must filter out the B and I labeled tokens. Each B and I will then be merged into a keyphrase. Finally, you need to strip the keyphrases to make sure all unnecessary spaces have been removed.
214
  ```python
215
  # Define post_process functions
216
  def concat_tokens_by_tag(keyphrases):
 
242
  return np.unique([kp.strip() for kp in extracted_kps])
243
 
244
  ```
 
245
 
246
+ ## 📝 Evaluation Results
247
+
248
+ Traditional evaluation methods are the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
249
  The model achieves the following results on the KPTimes test set:
250
 
251
  | Dataset | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
252
  |:-----------------:|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|
253
  | KPTimes Test Set | 0.19 | 0.36 | 0.23 | 0.10 | 0.37 | 0.15 | 0.35 | 0.37 | 0.33 |
254
 
255
+ For more information on the evaluation process, you can take a look at the keyphrase extraction [evaluation notebook]().
256
 
257
  ## 🚨 Issues
258
  Please feel free to start discussions in the Community Tab.