talaugust/sci-writing-strategies

RoBERTa Science writing strategy classifiers

This is a finetuned BART Large model from the paper:

"Writing Strategies for Science Communication: Data and Computational Analysis", By Tal August, Lauren Kim, Katharina Reinecke, and Noah A. Smith Published at the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020.

Abstract: Communicating complex scientific ideas without misleading or overwhelming the public is challenging. While science communication guides exist, they rarely offer empirical evidence for how their strategies are used in practice. Writing strategies that can be automatically recognized could greatly support science communication efforts by enabling tools to detect and suggest strategies for writers. We compile a set of writing strategies drawn from a wide range of prescriptive sources and develop an annotation scheme allowing humans to recognize them. We collect a corpus of 128k science writing documents in English and annotate a subset of this corpus. We use the annotations to train transformer-based classifiers and measure the strategies’ use in the larger corpus. We find that the use of strategies, such as storytelling and emphasizing the most important findings, varies significantly across publications with different reader audiences.

Description

The model is finetuned on the task of identifying if a given sentence from a science news article is using a particular writing strategy (e.g., emphasizing the real world impact of the scientific findings).

The intended use of this model is to identify common science communication writing strategies.

The model is trained on annotated sentences drawn from science news articles. The URLs for the original news articles are at [https://github.com/talaugust/scientific-writing-strategies].

Biases & Limitations

The goal of this model is to enable a wider audience of readers to understand and engage with scientific writing. A risk, though, is that such attempts might instead widen the gap to accessing scientific information. The texts in the datasets we train our models on are in General or Academic American. English. Many people, especially those who have been historically underrepresented in STEM disciplines and medicine, may not be comfortable with this dialect of English. This risks further alienating the readers we hope to serve. An important and exciting direction in NLP is making models more flexible to dialects and low-resource languages.