bibliography-parser / model_card.md
vitaly's picture
model card
b525a4b
metadata
tags:
  - spacy
  - token-classification
language:
  - en
model-index:
  - name: en_bib_references_trf
    results:
      - task:
          name: NER
          type: token-classification
        metrics:
          - name: NER Precision
            type: precision
            value: 0.9926182519
          - name: NER Recall
            type: recall
            value: 0.9902421615
          - name: NER F Score
            type: f_score
            value: 0.9914287831
      - task:
          name: SENTS
          type: token-classification
        metrics:
          - name: Sentences F-Score
            type: f_score
            value: 0.9619008264
Feature Description
Name en_bib_references_trf
Version 1.0.1
spaCy >=3.4.0,<3.5.0
Default Pipeline transformer, senter, ner, spancat
Components transformer, senter, ner, spancat
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources n/a
License n/a
Author Vitaly Davidenko

Problem to solve

This pipeline parses list of bibliographic references. It is not required each reference to be on a separate line.

  1. SentenceRecognizer and SpanCategorizer components are included into the pipeline to split up the bibliography section of a scientific paper into separate references.
  2. NER in the pipeline annotates the reference structure

Dataset

The distillroberta-base checkpoint has been fine-tuned on artificial data: bibliography sections were generated using Citations Style Language from 6000 citeproc-json files downloaded from CrossRef. 95 selected styles were used to generate different representations of bibliography sections.

This work is based on the "GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing" paper with code in this GitHub repo. Modifications required to extend this approach to support the bibliography section as well as code for training the SpaCy pipeline are in this GitHub fork.

=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, senter, ner, spancat
91178 training docs
1908 evaluation docs

Preprocessing

Although the end-of-line character '\n' seems to give a useful signal to a model that splits up the bibliography section, it was also challenging to create a balanced artificial dataset with multiline references. Instead, it was decided to train the model on data that does not contain the line separator characters at all:

lines = io.StringIO(references)
# normalization applied: strip lines and remove any extra space between lines
norm_doc = nlp(" ".join([line.strip() for line in lines if line.strip()]))

Postprocessing

If your data never contains more than one reference per line, you can use SpanCat scores to estimate whether the next line is the next reference or it is the next part of the current multiline reference. See code for additional details

Spaces App

Bibliography Parser

Label Scheme

Essentially the pipeline is the tokens classification task.

  • NER Labels come from not overlapped CSL tags.
  • SentenceRecognizer:Token.is_sent_start=1 is set for the first token of each reference.
  • SpanCategorizer: the 'bib' span is created for the first token of each reference. It is an alternative for SentenceRecognizer that returns scores.
View label scheme (13 labels for 2 components)
Component Labels
ner citation-label, citation-number, container-title, doi, family, given, issued, page, publisher, title, url, volume
spancat bib

Accuracy

Type Score
SENTS_F 96.19
SENTS_P 97.36
SENTS_R 95.04
ENTS_F 99.14
ENTS_P 99.26
ENTS_R 99.02
SPANS_SC_F 98.47
SPANS_SC_P 99.87
SPANS_SC_R 97.10
TRANSFORMER_LOSS 1042090.07
SENTER_LOSS 1079996.00
NER_LOSS 931993.00
SPANCAT_LOSS 119923.94