astroBERT: a language model for astrophysics

This public repository contains the work of the NASA/ADS on building an NLP language model tailored to astrophysics, along with tutorials and miscellaneous related files. This model is cased (it treats ads and ADS differently).

astroBERT models

  1. Base model: Pretrained model on English language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. It was introduced in this paper at ADASS 2021 and made public at ADASS 2022.
  2. NER-DEAL model: This model adds a token classification head to the base model finetuned on the DEAL@WIESP2022 named entity recognition task. Must be loaded from the revision='NER-DEAL' branch (see tutorial 2).
  3. SciX Categorizer: This model was finetuned to classify text into one of 7 categories of interest to SciX (Astronomy, Heliophysics, Planetary Science, Earth Science, NASA-funded Biophysics, Other Physics, Other, Text Garbage).

Tutorials

  1. generate text embedding (for downstream tasks)
  2. use astroBERT for the Fill-Mask task
  3. make NER-DEAL predictions
  4. categorize texts for SciX

BibTeX

@ARTICLE{2021arXiv211200590G,
       author = {{Grezes}, Felix and {Blanco-Cuaresma}, Sergi and {Accomazzi}, Alberto and {Kurtz}, Michael J. and {Shapurian}, Golnaz and {Henneken}, Edwin and {Grant}, Carolyn S. and {Thompson}, Donna M. and {Chyla}, Roman and {McDonald}, Stephen and {Hostetler}, Timothy W. and {Templeton}, Matthew R. and {Lockhart}, Kelly E. and {Martinovic}, Nemanja and {Chen}, Shinyi and {Tanner}, Chris and {Protopapas}, Pavlos},
        title = "{Building astroBERT, a language model for Astronomy \& Astrophysics}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language, Astrophysics - Instrumentation and Methods for Astrophysics},
         year = 2021,
        month = dec,
          eid = {arXiv:2112.00590},
        pages = {arXiv:2112.00590},
archivePrefix = {arXiv},
       eprint = {2112.00590},
 primaryClass = {cs.CL},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211200590G},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
Downloads last month
3,765,153
Safetensors
Model size
110M params
Tensor type
I64
·
F32
·
Inference API
Examples
Mask token: [MASK]