haining commited on
Commit
3cb8e62
1 Parent(s): 3a916f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -19
README.md CHANGED
@@ -39,29 +39,33 @@ widget:
39
 
40
  # TL;DR
41
 
42
- Scientific Abstract Simplification (SAS) is a tool that rewrites difficult-to-understand scientific abstracts into simpler, easier-to-read versions. Our goal is to make scientific knowledge more accessible to everyone. If you've already tried our baseline model (sas_baseline), the current model is even better on all evaluation metrics. You can try it now with the Hosted Inference API on the right. Simply choose one of the provided examples or enter your own scientific abstract. Just remember to include the instruction "summarize, simplify, and contextualize: " at the beginning (with a space after the colon). For local use, see the [Usage] section.
 
 
 
 
 
43
 
44
  # Project Description
45
 
46
 
47
- Open science has significantly lowered the barriers to scientific papers.
48
- However, reachable research does not mean accessible knowledge. Scientific papers are usually replete with jargon and hard to read. A lay audience would rather trust little stories on social media than read scientific papers. They are not to blame, we human like stories.
49
- So why do not we "translate" arcane scientific abstracts into simpler yet relevant scientific stories🤗?
50
- Some renowned journals have already taken accessibility into consideration. For example, PNAS asks authors to submit Significance Statements targeting "an undergraduate-educated scientist." Science also includes an editor abstract for a quick dive.
 
51
 
52
- In this project, we propose to *rewrite scientific abstracts into understandable scientific stories using AI*.
53
- To this end, we introduce two new corpora: one comprises PNAS abstract-significance pairs and the other contains editor abstracts from Science.
54
- We finetune the scientifc abstract simplification task using an encoder-decoder Transformer model (a variant of Flan-T5).
55
- Our model is first tuned with multiple discrete instructions by mixing four relevant tasks in a challenge-proportional manner.
56
- Then we continue tuning the model solely with the abstract-significance corpus.
57
- The model can generate better lay summaries compared with models finetuned only with the abstract-significance corpus and models finetuned with task mixtures in traditonal ways.
58
- We hope our work can pave the last mile of scientific understanding and let people better enjoy the fruits of open science.
59
 
60
 
61
  - **Model type:** Language model
62
  - **Developed by:**
63
  - PIs: Jason Clark and Hannah McKelvey, Montana State University
64
- - Fellow: Haining Wang, Indiana University Bloomington
65
  - Collaborator: Zuoyu Tian, Indiana University Bloomington
66
  - [LEADING](https://cci.drexel.edu/mrc/leading/) Montana State University Library, Project "TL;DR it": Automating Article Synopses for Search Engine Optimization and Citizen Science
67
  - **Language(s) (NLP):** English
@@ -119,25 +123,27 @@ We finetuned the base model (flan-t5-large) on multiple relevant tasks with stan
119
  | Total | Challenge-proportional Mixing | n/a | 263,400 |
120
 
121
 
122
- - Multi-instruction tuning: In the stage, we first created a task mixture using "challenge-proportional mixing" method. In a seperate pilot studie, for each task, we finetuned it on a base model and observed the number of samples when validation loss starts to rise. We mixed the samples of each task proportional to its optimal number of samples. A corpus is exhausted before upsampling if the number of total samples is smaller than its optimal number. We finetune with the task mixture (263,400 samples) with the aforementioned template.
123
 
124
- - Retuning: In this stage, we continued finetuning the checkpoint solely with the Scientific Abstract-Significance corpus till optimal validation loss was observed.
125
 
126
  The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours, respectively, on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy across training stages. The batch size equals to 1.
127
 
128
 
129
  # Evaluation
130
 
131
- The model is evaluated on the SAS test set using the following metrics.
132
 
133
  ## Metrics
 
 
134
  - [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu): SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
135
  - [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore): BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
136
  - [ROUGLE](https://huggingface.co/spaces/evaluate-metric/rouge)-1/2/L: ROUGE is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
137
  - [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor): METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
138
  - [SARI](https://huggingface.co/spaces/evaluate-metric/sari): SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
139
  - [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
140
-
141
 
142
  Implementations of SacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface [`evaluate`](https://pypi.org/project/evaluate/) v.0.3.0. ARI is from [`py-readability-metrics`](https://pypi.org/project/py-readability-metrics/) v.1.4.5.
143
 
@@ -166,8 +172,7 @@ Please [contact us](mailto:hw56@indiana.edu) for any questions or suggestions.
166
 
167
  # Disclaimer
168
 
169
- This model is created for making scientific abstracts more accessible. Its outputs should not be used or trusted outside of its scope. There is no guarantee that the generated text is perfectly aligned with the research. Resort to human experts or original papers when a decision is critical.
170
-
171
 
172
  # Acknowledgement
173
  This research is supported by the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.
 
39
 
40
  # TL;DR
41
 
42
+ Scientific Abstract Simplification (SAS) is a tool that rewrites difficult-to-understand scientific abstracts into simpler, easier-to-read versions.
43
+ Our goal is to make scientific knowledge more accessible to everyone. If you've already tried our baseline model (`sas_baseline`),
44
+ the current model is even better on all evaluation metrics. You can try it now with the Hosted Inference API on the right.
45
+ Simply choose one of the provided examples or enter your own scientific abstract.
46
+ Just remember to include the instruction "summarize, simplify, and contextualize: " at the beginning (with a space after the colon).
47
+ For local use, see the [Usage](#Usage) section.
48
 
49
  # Project Description
50
 
51
 
52
+ "Open science has greatly reduced the barriers to accessing scientific papers. However, reachable research does not mean accessible knowledge.
53
+ As a result, many people may prefer to trust short stories on social media rather than attempting to read a scientific paper.
54
+ This is understandable, as we humans often prefer stories to dry, technical information.
55
+ So, why not "translate" these complex scientific abstracts into simpler, more accessible stories? Some prestigious journals are already taking steps towards greater accessibility.
56
+ For example, PNAS requires authors to submit Significance Statements that can be understood by an "undergraduate-educated scientist," and Science includes an editor abstract for a quick overview of the paper's key points.
57
 
58
+ In this project, we aim to use AI to rewrite scientific abstracts as easily understandable scientific stories.
59
+ To do this, we have created two new datasets: one containing PNAS abstract-significance pairs, and the other containing editor abstracts from Science.
60
+ We use a Transformer model (a variant called Flan-T5) to fine-tune our model for the task of simplifying scientific abstracts. Initially, the model will be fine-tuned using multiple discrete instructions by combining four relevant tasks in a challenge-proportional manner (i.e., we call it Multi-Instruction Pretuning).
61
+ Then, we continue fine-tuning the model using only the abstract-significance corpus. Our model is able to produce lay summaries that are better than models fine-tuned only with the abstract-significance corpus and models fine-tuned with task combinations in traditional ways.
62
+ We hope our work can facilitate greater understanding of scientific research and allow more people to benefit from open science.
 
 
63
 
64
 
65
  - **Model type:** Language model
66
  - **Developed by:**
67
  - PIs: Jason Clark and Hannah McKelvey, Montana State University
68
+ - Fellow: Haining Wang, Indiana University Bloomington; Deanna Zarrillo, Drexel University
69
  - Collaborator: Zuoyu Tian, Indiana University Bloomington
70
  - [LEADING](https://cci.drexel.edu/mrc/leading/) Montana State University Library, Project "TL;DR it": Automating Article Synopses for Search Engine Optimization and Citizen Science
71
  - **Language(s) (NLP):** English
 
123
  | Total | Challenge-proportional Mixing | n/a | 263,400 |
124
 
125
 
126
+ - Multi-instruction pretuning: In the stage, we first created a task mixture using "challenge-proportional mixing" method. In a seperate pilot studie, for each task, we finetuned it on a base model and observed the number of samples when validation loss starts to rise. We mixed the samples of each task proportional to its optimal number of samples. A corpus is exhausted before upsampling if the number of total samples is smaller than its optimal number. We finetune with the task mixture (263,400 samples) with the aforementioned template.
127
 
128
+ - fine-tuning: In this stage, we continued finetuning the checkpoint solely with the Scientific Abstract-Significance corpus till optimal validation loss was observed.
129
 
130
  The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours, respectively, on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy across training stages. The batch size equals to 1.
131
 
132
 
133
  # Evaluation
134
 
135
+ The model is evaluated on the SAS test set using SacreBLEU, METEOR, BERTScore, ROUGLE, SARI, and ARI.
136
 
137
  ## Metrics
138
+ <details>
139
+ <summary> Click to expand </summary>
140
  - [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu): SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
141
  - [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore): BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
142
  - [ROUGLE](https://huggingface.co/spaces/evaluate-metric/rouge)-1/2/L: ROUGE is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
143
  - [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor): METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
144
  - [SARI](https://huggingface.co/spaces/evaluate-metric/sari): SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
145
  - [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
146
+ </details>
147
 
148
  Implementations of SacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface [`evaluate`](https://pypi.org/project/evaluate/) v.0.3.0. ARI is from [`py-readability-metrics`](https://pypi.org/project/py-readability-metrics/) v.1.4.5.
149
 
 
172
 
173
  # Disclaimer
174
 
175
+ This model is designed to make scientific abstracts more accessible. Its outputs should not be relied upon for any purpose outside of this scope. There is no guarantee that the generated text accurately reflects the research it is based on. When making important decisions, it is recommended to seek the advice of human experts or consult the original papers.
 
176
 
177
  # Acknowledgement
178
  This research is supported by the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.