talaugust/bart-sci-definition

BART Scientific Definition Generation

This is a finetuned BART Large model from the paper:

"Generating Scientific Definitions with Controllable Complexity"

By Tal August, Katharina Reinecke, and Noah A. Smith

Abstract: Unfamiliar terminology and complex language can present barriers to understanding science. Natural language processing stands to help address these issues by automatically defining unfamiliar terms. We introduce a new task and dataset for defining scientific terms and controlling the complexity of gen- erated definitions as a way of adapting to a specific reader’s background knowledge. We test four definition generation methods for this new task, finding that a sequence-to-sequence approach is most successful. We then explore the version of the task in which definitions are generated at a target complexity level. We in- troduce a novel reranking approach and find in human evaluations that it offers superior fluency while also controlling complexity, compared to several controllable generation baselines.

Description

The model is finetuned on the task of generating definitions of scientific terms. We frame our task as generating an answer to the question “What is (are) X?” Along with the question, the model takes a support document of scientific abstracted related to the term being defined.

Intended use

The intended use of this model is to generate definitions of scientific terms. It is NOT intended for public deployment due to the risk of hallucinated information in model output. Strong supervision of definition factuality is important for any future deployment of such a system. While hallucinated information can be damaging in any generation context, incorrect scientific definitions could mislead readers and potentially contribute to broader scientific misinformation. The model is trained on data we believe is trustworthy (e.g., questions and answers from NIH websites); however, this is no guarantee that model output will be.

Training data

The model is trained on data from two sources: Wikipedia science glossaries and a portion of the MedQuAD dataset, which contains healthcare consumer questions and answers from NIH websites. For more information on these data sources, see the github repository for the paper.

How to use

Note that this model was trained and evaluated using transformers version 4.2.2

  from transformers import (
       AutoTokenizer,
       AutoModelForSeq2SeqLM,
       AutoConfig, 
  )
  
  bart_sci_def_tokenizer = AutoTokenizer.from_pretrained("talaugust/bart-sci-definition")
  bart_sci_def_model = AutoModelForSeq2SeqLM.from_pretrained("talaugust/bart-sci-definition")
  
  inputs = bart_sci_def_tokenizer("question: What is (are) surfactants? context: <P> .... <P> ...." , return_tensors='pt')
  
  outputs = bart_sci_def_model.generate(**inputs,
                                   decoder_start_token_id=tokenizer.bos_token_id,
                                   num_return_sequences=1,
                                   num_beams=5,
                                   max_length=64,
                                   min_length=8,
                                   early_stopping=True,
                                   temperature=None,
                                   do_sample=True,
                                   top_k=50,
                                   top_p=0.9,
                                   max_input_length=1024,
                                   no_repeat_ngram_size=3,
                                   device=None)
  answers = [bart_sci_def_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in outputs[0]]

Biases & Limitations

The goal of this model is to enable a wider audience of readers to understand and engage with scientific writing. A risk, though, is that such attempts might instead widen the gap to accessing scientific information. The texts in the datasets we train our models on are in General or Academic American. English. Many people, especially those who have been historically underrepresented in STEM disciplines and medicine, may not be comfortable with this dialect of English. This risks further alienating the readers we hope to serve. An important and exciting direction in NLP is making models more flexible to dialects and low-resource languages.