nandakishormpai commited on
Commit
cffe3b8
1 Parent(s): 83a9730

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -29
README.md CHANGED
@@ -2,11 +2,24 @@
2
  license: apache-2.0
3
  tags:
4
  - generated_from_trainer
 
 
 
 
 
 
 
5
  metrics:
6
  - rouge
7
  model-index:
8
  - name: t5-small-machine-articles-tag-generation
9
  results: []
 
 
 
 
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,7 +27,68 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # t5-small-machine-articles-tag-generation
16
 
17
- This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on the None dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  It achieves the following results on the evaluation set:
19
  - Loss: 1.8786
20
  - Rouge1: 35.5143
@@ -23,19 +97,10 @@ It achieves the following results on the evaluation set:
23
  - Rougelsum: 32.6493
24
  - Gen Len: 17.5745
25
 
26
- ## Model description
27
-
28
- More information needed
29
-
30
- ## Intended uses & limitations
31
-
32
- More information needed
33
-
34
  ## Training and evaluation data
35
 
36
- More information needed
37
 
38
- ## Training procedure
39
 
40
  ### Training hyperparameters
41
 
@@ -46,26 +111,9 @@ The following hyperparameters were used during training:
46
  - seed: 42
47
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
48
  - lr_scheduler_type: linear
49
- - num_epochs: 20
50
  - mixed_precision_training: Native AMP
51
 
52
- ### Training results
53
-
54
- | Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Gen Len |
55
- |:-------------:|:-----:|:----:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|
56
- | 2.9902 | 1.0 | 47 | 2.5327 | 21.3135 | 8.6584 | 19.3983 | 19.6598 | 19.0 |
57
- | 2.7388 | 2.0 | 94 | 2.2970 | 22.3252 | 9.3302 | 21.6035 | 21.61 | 18.3298 |
58
- | 2.5481 | 3.0 | 141 | 2.1579 | 27.0804 | 13.3149 | 25.5412 | 25.5419 | 18.2553 |
59
- | 2.4268 | 4.0 | 188 | 2.0718 | 29.7762 | 14.9601 | 27.5516 | 27.4876 | 17.9149 |
60
- | 2.3651 | 5.0 | 235 | 2.0219 | 31.8162 | 16.0977 | 28.3376 | 28.3442 | 17.8298 |
61
- | 2.2935 | 6.0 | 282 | 1.9786 | 32.4803 | 16.6321 | 29.6387 | 29.5773 | 17.7553 |
62
- | 2.2474 | 7.0 | 329 | 1.9522 | 33.176 | 16.3891 | 29.708 | 29.7527 | 17.6915 |
63
- | 2.2121 | 8.0 | 376 | 1.9300 | 33.3863 | 16.8205 | 30.6451 | 30.5657 | 17.8085 |
64
- | 2.1792 | 9.0 | 423 | 1.9160 | 34.1843 | 17.523 | 30.6954 | 30.6197 | 17.766 |
65
- | 2.149 | 10.0 | 470 | 1.9013 | 35.068 | 17.8979 | 31.835 | 31.8103 | 17.7021 |
66
- | 2.1388 | 11.0 | 517 | 1.8886 | 35.6427 | 18.3297 | 32.2549 | 32.1712 | 17.6489 |
67
- | 2.1184 | 12.0 | 564 | 1.8786 | 35.5143 | 18.6656 | 32.7292 | 32.6493 | 17.5745 |
68
-
69
 
70
  ### Framework versions
71
 
@@ -73,3 +121,8 @@ The following hyperparameters were used during training:
73
  - Pytorch 1.13.1+cu116
74
  - Datasets 2.9.0
75
  - Tokenizers 0.13.2
 
 
 
 
 
 
2
  license: apache-2.0
3
  tags:
4
  - generated_from_trainer
5
+ - machine_learning
6
+ - article_tag
7
+ - tag_generation
8
+ - ml_article_tag
9
+ - blog_tag_generation
10
+ - summarization
11
+ - tagging
12
  metrics:
13
  - rouge
14
  model-index:
15
  - name: t5-small-machine-articles-tag-generation
16
  results: []
17
+ widget:
18
+ - text: "Paige, AI in pathology and genomics\n\nFundamentally transforming the diagnosis and treatment of cancer\nPaige has raised $25M in total. We talked with Leo Grady, its CEO.\nHow would you describe Paige in a single tweet?\nAI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.\nHow did it all start and why? \nPaige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.\nWhat have you achieved so far?\nTEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.\nPRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.\nCUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.\nWhat do you plan to achieve in the next 2 or 3 years?\nCommercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer."
19
+ example_title: 'ML Article Example #1'
20
+ language:
21
+ - en
22
+ pipeline_tag: summarization
23
  ---
24
 
25
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
27
 
28
  # t5-small-machine-articles-tag-generation
29
 
30
+ Machine Learning model to generate Tags for Machine Learning related articles. This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) fine-tuned on a refined version of [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for generating Machine Learning article tags using the article textual content as input. While usually formulated as a multi-label classification problem, this model deals with _tag generation_ as a text2text generation task (inspiration and reference: [fabiochiu/t5-base-tag-generation](https://huggingface.co/fabiochiu/t5-base-tag-generation)).
31
+ <br><br>
32
+ Finetuning Notebook Reference: [Hugging face summarization notebook](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb).
33
+ # How to use the model
34
+ ### Installations
35
+
36
+ ```python
37
+ pip install transformers nltk
38
+ ```
39
+ ### Code
40
+
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
43
+ import nltk
44
+ nltk.download('punkt')
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
47
+ model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
48
+
49
+ article_text = """
50
+ Paige, AI in pathology and genomics
51
+
52
+ Fundamentally transforming the diagnosis and treatment of cancer
53
+ Paige has raised $25M in total. We talked with Leo Grady, its CEO.
54
+ How would you describe Paige in a single tweet?
55
+ AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
56
+ How did it all start and why?
57
+ Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
58
+ What have you achieved so far?
59
+ TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
60
+ PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
61
+ CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
62
+ What do you plan to achieve in the next 2 or 3 years?
63
+ Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
64
+ """
65
+
66
+ inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
67
+ output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
68
+ max_length=128)
69
+
70
+ decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
71
+
72
+ tags = [ tag.strip() for tag in decoded_output.split(",")]
73
+
74
+ print(tags)
75
+
76
+ # ['Paige', 'AI in pathology and genomics', 'AI in pathology', 'genomics']
77
+
78
+ ```
79
+ ## Dataset Preparation
80
+
81
+ Over the 190k article dataset from Kaggle, around 12k of them are Machine Learning based and the tags were pretty high level.
82
+ Generating more specific tags would be of use while developing a system for Technical Blog Platforms.
83
+ ML Articles were filtered out and around 1000 articles were sampled. GPT3 API was used to Tag those and then preprocessing on the generated tags was performed to enusre articles with 4 or 5 tags were selected for the final dataset that came around 940 articles.
84
+
85
+
86
+ ## Intended uses & limitations
87
+
88
+ This model can be used to generate Tags for Machine Learning articles primarily and can be used for other technical articles expecting a lesser accuracy and detail. The results might contain duplicate tags that must be handled in the postprocessing of results.
89
+
90
+ ## Results
91
+
92
  It achieves the following results on the evaluation set:
93
  - Loss: 1.8786
94
  - Rouge1: 35.5143
 
97
  - Rougelsum: 32.6493
98
  - Gen Len: 17.5745
99
 
 
 
 
 
 
 
 
 
100
  ## Training and evaluation data
101
 
102
+ The dataset of over 940 articles was split across train : val : test as 80:10:10 ratio samples.
103
 
 
104
 
105
  ### Training hyperparameters
106
 
 
111
  - seed: 42
112
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
113
  - lr_scheduler_type: linear
114
+ - num_epochs: 10
115
  - mixed_precision_training: Native AMP
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ### Framework versions
119
 
 
121
  - Pytorch 1.13.1+cu116
122
  - Datasets 2.9.0
123
  - Tokenizers 0.13.2
124
+
125
+
126
+
127
+
128
+