haining commited on
Commit
a58dea3
1 Parent(s): fdda21e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -28
README.md CHANGED
@@ -36,10 +36,10 @@ widget:
36
 
37
  # TL;DR
38
 
39
- Scientific Abstract Simplification-baseline *translates* hard-to-read scientific abstracts😵 into more accessible language😇. We hope it can make scientific knowledge accessible for everyone.
40
 
41
- Try it now with Hosted inference API on the right.
42
- You can choose an existing example or paste in any (perhaps full-of-jargon) abstract, but don't forget to include the instruction before the abstract ("summarize, simplify, and contextualize: "; notice, there is a whitespace after the colon). Local use refers to Section [Usage](#Usage).
43
 
44
 
45
  # Model Details
@@ -48,22 +48,24 @@ You can choose an existing example or paste in any (perhaps full-of-jargon) abst
48
 
49
 
50
  Open science has significantly lowered the barriers to scientific papers.
51
- However, reachable does not mean accessible. Scientific papers are usually flooded with jargon and hard to read. A lay audience would rather trust little stories on social media than read scientific papers. They are not to blame, we human like stories.
52
  So why do not we "translate" arcane scientific abstracts into accessible, simpler, and relevant scientific stories?
53
- Some renowned journals have already taken accessibility into consideration. For example, PNAS asks authors to submit Significance Statements targeting "an undergraduate-educated scientist." Science also includes an editor abstract for a quick dive on the following research.
54
  We propose to rewrite scientific abstracts into understandable scientific stories using AI.
55
- To this end, we introduce a new corpus comprising abstract and significance-statement pairs.
56
  We finetune an encoder-decoder Transformer model (a variant of Flan-T5) with the corpus.
57
  Our baseline model (SAS-baseline) shows promising capacity in simplifying and summarizing scientific abstracts.
58
- We hope our work can let people better enjoy the fruits of open science.
59
- As an ongoing effort, we are still working on boosting the model's performance of re-contextualization and avoiding certain jargon tokens during inference (which lowers readability).
 
60
 
61
  <!-- We hypothesize the last mile of scientific understanding is cognitive. -->
62
 
63
  - **Model type:** Language model
64
- - **Developed by:** [LEADING](https://cci.drexel.edu/mrc/leading/) Montana State University Library ("TL;DR it": Automating Article Synopses for Search Engine Optimization and Citizen Science).
65
  - Mentors: Jason Clark and Hannah McKelvey
66
- - Fellows: Haining Wang and Deanna Zarrillo.
 
67
  - **Language(s) (NLP):** English
68
  - **License:** MIT
69
  - **Parent Model:** [FLAN-T5-large](https://huggingface.co/google/flan-t5-large)
@@ -86,8 +88,16 @@ model = AutoModelForSeq2SeqLM.from_pretrained("haining/sas_baseline")
86
  input_text = "The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making."
87
 
88
  encoding = tokenizer(
89
- INSTRUCTION + input_text, max_length=672, padding='max_length', truncation=True, return_tensors='pt')
90
- decoded_ids = model.generate(input_ids=encoding['input_ids'], attention_mask=encoding['attention_mask'], max_new_tokens=512, top_p=.9, do_sample=True)
 
 
 
 
 
 
 
 
91
 
92
  print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))
93
  ```
@@ -117,12 +127,12 @@ Notice, the readability of the signifiance statements is generally lower than th
117
  The model is evaluated on the SAS test set using the following metrics.
118
 
119
  ## Metrics
120
- - sacreBLEU: [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
121
- - BERT Score: [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
122
  - ROUGLE-1/2/L: [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge), or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
123
- - METEOR: [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor), an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
124
- - SARI: [SARI](https://huggingface.co/spaces/evaluate-metric/sari) is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
125
- - Automated Readability Index (ARI): [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php) is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
126
 
127
 
128
  Implementations of sacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface [`evaluate`](https://pypi.org/project/evaluate/) v.0.3.0. ARI is from [`py-readability-metrics`](https://pypi.org/project/py-readability-metrics/) v.1.4.5.
@@ -130,16 +140,16 @@ Implementations of sacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Hugg
130
 
131
  ## Results
132
 
133
- | Metrics | SAS-baseline |
134
- |----------------|--------------|
135
- | sacreBLEU↑ | 20.97 |
136
- | BERT Score F1↑ | 0.89 |
137
- | ROUGLE-1↑ | 0.48 |
138
- | ROUGLE-2↑ | 0.23 |
139
- | ROUGLE-L↑ | 0.32 |
140
- | METEOR↑ | 0.39 |
141
- | SARI↑ | 46.83 |
142
- | ARI↓* | 17.12 (1.97) |
143
 
144
  * Note: Half of the generated texts are too short (less than 100 words) to calcualte meaningful ARI. We therefore concatenated adjecent two texts and compute ARI for the 100 texts (instead of original 200 texts).
145
 
@@ -150,7 +160,7 @@ Please [contact us](mailto:hw56@indiana.edu) for any questions or suggestions.
150
 
151
  # Disclaimer
152
 
153
- The model (SAS-baseline) is created for making scientific abstracts more accessible. Its outputs should not be used or trusted outside of its scope. There is **NO** guarantee that the generated text is perfectly aligned with the research. Resort to human experts or original papers when a decision is critical.
154
 
155
 
156
  # Acknowledgement
 
36
 
37
  # TL;DR
38
 
39
+ Scientific Abstract Simplification-baseline *translates* hard-to-read scientific abstracts😵 into more accessible language😇. We hope it can make scientific knowledge accessible for everyone🤗.
40
 
41
+ Try it now with the Hosted inference API on the right.
42
+ You can choose an existing example or paste in any (perhaps full-of-jargon) abstract. Remember to prepend the instruction before the abstract ("summarize, simplify, and contextualize: "; notice, there is a whitespace after the colon). Local use refers to Section [Usage](#Usage).
43
 
44
 
45
  # Model Details
 
48
 
49
 
50
  Open science has significantly lowered the barriers to scientific papers.
51
+ However, reachable research does not mean accessible knowledge. Scientific papers are usually replete with jargon and hard to read. A lay audience would rather trust little stories on social media than read scientific papers. They are not to blame, we human like stories.
52
  So why do not we "translate" arcane scientific abstracts into accessible, simpler, and relevant scientific stories?
53
+ Some renowned journals have already taken accessibility into consideration. For example, PNAS asks authors to submit Significance Statements targeting "an undergraduate-educated scientist." Science also includes an editor abstract for a quick dive.
54
  We propose to rewrite scientific abstracts into understandable scientific stories using AI.
55
+ To this end, we introduce a new corpus comprising PNAS abstract-significance pairs.
56
  We finetune an encoder-decoder Transformer model (a variant of Flan-T5) with the corpus.
57
  Our baseline model (SAS-baseline) shows promising capacity in simplifying and summarizing scientific abstracts.
58
+ We hope our work can pave the last mile of scientific understanding and let people better enjoy the fruits of open science.
59
+
60
+ As an ongoing effort, we are working on re-contextualizating abstracts for better storytelling and avoiding certain jargon tokens during inference time for better readability.
61
 
62
  <!-- We hypothesize the last mile of scientific understanding is cognitive. -->
63
 
64
  - **Model type:** Language model
65
+ - **Developed by:**
66
  - Mentors: Jason Clark and Hannah McKelvey
67
+ - Fellows: Haining Wang and Deanna Zarrillo
68
+ - We are from [LEADING](https://cci.drexel.edu/mrc/leading/) Montana State University Library ("TL;DR it": Automating Article Synopses for Search Engine Optimization and Citizen Science).
69
  - **Language(s) (NLP):** English
70
  - **License:** MIT
71
  - **Parent Model:** [FLAN-T5-large](https://huggingface.co/google/flan-t5-large)
 
88
  input_text = "The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making."
89
 
90
  encoding = tokenizer(
91
+ INSTRUCTION + input_text,
92
+ max_length=672,
93
+ padding='max_length',
94
+ truncation=True,
95
+ return_tensors='pt')
96
+ decoded_ids = model.generate(input_ids=encoding['input_ids'],
97
+ attention_mask=encoding['attention_mask'],
98
+ max_new_tokens=512,
99
+ top_p=.9,
100
+ do_sample=True)
101
 
102
  print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))
103
  ```
 
127
  The model is evaluated on the SAS test set using the following metrics.
128
 
129
  ## Metrics
130
+ - [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu): SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
131
+ - [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore): BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
132
  - ROUGLE-1/2/L: [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge), or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
133
+ - [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor): METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
134
+ - [SARI](https://huggingface.co/spaces/evaluate-metric/sari): SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
135
+ - [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
136
 
137
 
138
  Implementations of sacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface [`evaluate`](https://pypi.org/project/evaluate/) v.0.3.0. ARI is from [`py-readability-metrics`](https://pypi.org/project/py-readability-metrics/) v.1.4.5.
 
140
 
141
  ## Results
142
 
143
+ | Metrics | SAS-baseline |
144
+ |----------------|-------------------|
145
+ | sacreBLEU↑ | 20.97 |
146
+ | BERT Score F1↑ | 0.89 |
147
+ | ROUGLE-1↑ | 0.48 |
148
+ | ROUGLE-2↑ | 0.23 |
149
+ | ROUGLE-L↑ | 0.32 |
150
+ | METEOR↑ | 0.39 |
151
+ | SARI↑ | 46.83 |
152
+ | ARI↓* | 17.12 (std. 1.97) |
153
 
154
  * Note: Half of the generated texts are too short (less than 100 words) to calcualte meaningful ARI. We therefore concatenated adjecent two texts and compute ARI for the 100 texts (instead of original 200 texts).
155
 
 
160
 
161
  # Disclaimer
162
 
163
+ The model (SAS-baseline) is created for making scientific abstracts more accessible. Its outputs should not be used or trusted outside of its scope. There is no guarantee that the generated text is perfectly aligned with the research. Resort to human experts or original papers when a decision is critical.
164
 
165
 
166
  # Acknowledgement