Update README.md
Browse files
README.md
CHANGED
@@ -3,26 +3,28 @@ datasets:
|
|
3 |
- bakhitovd/data_science_arxiv
|
4 |
metrics:
|
5 |
- rouge
|
|
|
|
|
6 |
---
|
7 |
# Fine-tuned Longformer for Summarization of Machine Learning Articles
|
8 |
|
9 |
## Model Details
|
10 |
- GitHub: https://github.com/Bakhitovd/MS_in_Data_Science_Capstone
|
11 |
-
- Model name: bakhitovd/led-base-
|
12 |
- Model type: Longformer (alenai/led-base-16384)
|
13 |
- Model description: This Longformer model has been fine-tuned on a focused subset of the arXiv part of the scientific papers dataset, specifically targeting articles about Machine Learning. It aims to generate accurate and consistent summaries of machine learning research papers.
|
14 |
## Intended Use
|
15 |
This model is intended to be used for text summarization tasks, specifically for summarizing machine learning research papers.
|
16 |
## How to Use
|
17 |
-
|
18 |
import torch
|
19 |
from transformers import LEDTokenizer, LEDForConditionalGeneration
|
20 |
-
tokenizer = LEDTokenizer.from_pretrained("bakhitovd/led-base-
|
21 |
-
model = LEDForConditionalGeneration.from_pretrained("bakhitovd/led-base-
|
22 |
-
|
23 |
|
24 |
## Use the model for summarization
|
25 |
-
|
26 |
article = "... long document ..."
|
27 |
inputs_dict = tokenizer.encode(article, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
|
28 |
input_ids = inputs_dict.input_ids.to("cuda")
|
@@ -32,7 +34,7 @@ global_attention_mask[:, 0] = 1
|
|
32 |
predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512)
|
33 |
summary = tokenizer.decode(predicted_abstract_ids, skip_special_tokens=True)
|
34 |
print(summary)
|
35 |
-
|
36 |
## Training Data
|
37 |
Dataset name: bakhitovd/data_science_arxiv
|
38 |
This dataset is a subset of the 'Scientific papers' dataset, which contains articles semantically, structurally, and meaningfully closest to articles describing machine learning. This subset was obtained using K-means clustering on the embeddings generated by SciBERT.
|
|
|
3 |
- bakhitovd/data_science_arxiv
|
4 |
metrics:
|
5 |
- rouge
|
6 |
+
license: cc0-1.0
|
7 |
+
pipeline_tag: summarization
|
8 |
---
|
9 |
# Fine-tuned Longformer for Summarization of Machine Learning Articles
|
10 |
|
11 |
## Model Details
|
12 |
- GitHub: https://github.com/Bakhitovd/MS_in_Data_Science_Capstone
|
13 |
+
- Model name: bakhitovd/led-base-7168-ml
|
14 |
- Model type: Longformer (alenai/led-base-16384)
|
15 |
- Model description: This Longformer model has been fine-tuned on a focused subset of the arXiv part of the scientific papers dataset, specifically targeting articles about Machine Learning. It aims to generate accurate and consistent summaries of machine learning research papers.
|
16 |
## Intended Use
|
17 |
This model is intended to be used for text summarization tasks, specifically for summarizing machine learning research papers.
|
18 |
## How to Use
|
19 |
+
```
|
20 |
import torch
|
21 |
from transformers import LEDTokenizer, LEDForConditionalGeneration
|
22 |
+
tokenizer = LEDTokenizer.from_pretrained("bakhitovd/led-base-7168-ml")
|
23 |
+
model = LEDForConditionalGeneration.from_pretrained("bakhitovd/led-base-7168-ml")
|
24 |
+
```
|
25 |
|
26 |
## Use the model for summarization
|
27 |
+
```
|
28 |
article = "... long document ..."
|
29 |
inputs_dict = tokenizer.encode(article, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
|
30 |
input_ids = inputs_dict.input_ids.to("cuda")
|
|
|
34 |
predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512)
|
35 |
summary = tokenizer.decode(predicted_abstract_ids, skip_special_tokens=True)
|
36 |
print(summary)
|
37 |
+
```
|
38 |
## Training Data
|
39 |
Dataset name: bakhitovd/data_science_arxiv
|
40 |
This dataset is a subset of the 'Scientific papers' dataset, which contains articles semantically, structurally, and meaningfully closest to articles describing machine learning. This subset was obtained using K-means clustering on the embeddings generated by SciBERT.
|