bakhitovd commited on
Commit
c924b9c
1 Parent(s): 798ac2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -1
README.md CHANGED
@@ -3,4 +3,40 @@ datasets:
3
  - bakhitovd/data_science_arxiv
4
  metrics:
5
  - rouge
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  - bakhitovd/data_science_arxiv
4
  metrics:
5
  - rouge
6
+ ---
7
+ # Fine-tuned Longformer for Summarization of Machine Learning Articles
8
+
9
+ ## Model Details
10
+ - GitHub: https://github.com/Bakhitovd/MS_in_Data_Science_Capstone
11
+ - Model name: bakhitovd/led-base-16384-data-science
12
+ - Model type: Longformer (alenai/led-base-16384)
13
+ - Model description: This Longformer model has been fine-tuned on a focused subset of the arXiv part of the scientific papers dataset, specifically targeting articles about Machine Learning. It aims to generate accurate and consistent summaries of machine learning research papers.
14
+ ## Intended Use
15
+ This model is intended to be used for text summarization tasks, specifically for summarizing machine learning research papers.
16
+ ## How to Use
17
+ ~~~
18
+ import torch
19
+ from transformers import LEDTokenizer, LEDForConditionalGeneration
20
+ tokenizer = LEDTokenizer.from_pretrained("bakhitovd/led-base-16384-data-science")
21
+ model = LEDForConditionalGeneration.from_pretrained("bakhitovd/led-base-16384-data-science")
22
+ ~~~
23
+
24
+ ## Use the model for summarization
25
+ ~~~
26
+ article = "... long document ..."
27
+ inputs_dict = tokenizer.encode(article, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
28
+ input_ids = inputs_dict.input_ids.to("cuda")
29
+ attention_mask = inputs_dict.attention_mask.to("cuda")
30
+ global_attention_mask = torch.zeros_like(attention_mask)
31
+ global_attention_mask[:, 0] = 1
32
+ predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512)
33
+ summary = tokenizer.decode(predicted_abstract_ids, skip_special_tokens=True)
34
+ print(summary)
35
+ ~~~
36
+ ## Training Data
37
+ Dataset name: bakhitovd/data_science_arxiv
38
+ This dataset is a subset of the 'Scientific papers' dataset, which contains articles semantically, structurally, and meaningfully closest to articles describing machine learning. This subset was obtained using K-means clustering on the embeddings generated by SciBERT.
39
+ ## Evaluation Results
40
+ The model's performance was evaluated using ROUGE metrics and it showed improved performance over the baseline models.
41
+
42
+ ![image.png](https://s3.amazonaws.com/moonup/production/uploads/63fb9a520aa18292d5c1027a/19mfKrjHkiCFDAL557Vsu.png)