vitaly commited on
Commit
b525a4b
1 Parent(s): b2c3cb2

model card

Browse files
Files changed (1) hide show
  1. model_card.md +122 -0
model_card.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - spacy
4
+ - token-classification
5
+ language:
6
+ - en
7
+ model-index:
8
+ - name: en_bib_references_trf
9
+ results:
10
+ - task:
11
+ name: NER
12
+ type: token-classification
13
+ metrics:
14
+ - name: NER Precision
15
+ type: precision
16
+ value: 0.9926182519
17
+ - name: NER Recall
18
+ type: recall
19
+ value: 0.9902421615
20
+ - name: NER F Score
21
+ type: f_score
22
+ value: 0.9914287831
23
+ - task:
24
+ name: SENTS
25
+ type: token-classification
26
+ metrics:
27
+ - name: Sentences F-Score
28
+ type: f_score
29
+ value: 0.9619008264
30
+ ---
31
+ | Feature | Description |
32
+ | --- | --- |
33
+ | **Name** | `en_bib_references_trf` |
34
+ | **Version** | `1.0.1` |
35
+ | **spaCy** | `>=3.4.0,<3.5.0` |
36
+ | **Default Pipeline** | `transformer`, `senter`, `ner`, `spancat` |
37
+ | **Components** | `transformer`, `senter`, `ner`, `spancat` |
38
+ | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
39
+ | **Sources** | n/a |
40
+ | **License** | n/a |
41
+ | **Author** | [Vitaly Davidenko]() |
42
+
43
+
44
+ ### Problem to solve
45
+
46
+ This pipeline parses list of bibliographic references. It is not required each reference to be on a separate line.
47
+ 1. [SentenceRecognizer](https://spacy.io/api/sentencerecognizer) and [SpanCategorizer](https://spacy.io/api/spancategorizer) components are included into the pipeline to split up the bibliography section of a scientific paper into separate references.
48
+ 2. [NER](https://spacy.io/api/entityrecognizer) in the pipeline annotates the reference structure
49
+
50
+
51
+ ### Dataset
52
+
53
+ The [distillroberta-base](https://huggingface.co/distilroberta-base) checkpoint has been fine-tuned on artificial data: bibliography sections were generated using [Citations Style Language](https://github.com/citation-style-language/styles) from 6000 [citeproc-json](https://citation.crosscite.org/docs.html) files [downloaded](https://github.com/vitaly-d/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset/blob/master/dataset-creation/crossref/crossrefDownload.py) from [CrossRef](https://www.crossref.org). 95 selected styles were used to generate different representations of bibliography sections.
54
+
55
+ This work is based on the ["GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing"](https://www.semanticscholar.org/paper/GIANT%3A-The-1-Billion-Annotated-Synthetic-Dataset-Grennan-Schibel/8438d1497a01827aa278632f517d3af31fb6bc5a) paper with code in this [GitHub repo](https://github.com/BeelGroup/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset).
56
+ [Modifications](https://github.com/vitaly-d/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset/tree/master/dataset-creation) required to extend this approach to support the bibliography section as well as [code](https://github.com/vitaly-d/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset/tree/master/training-model/references) for training the SpaCy pipeline are in this [GitHub fork](https://github.com/vitaly-d/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset).
57
+
58
+ ```
59
+ =============================== Training stats ===============================
60
+ Language: en
61
+ Training pipeline: tok2vec, senter, ner, spancat
62
+ 91178 training docs
63
+ 1908 evaluation docs
64
+ ```
65
+
66
+
67
+ ### Preprocessing
68
+
69
+ Although the end-of-line character '\n' seems to give a useful signal to a model that splits up the bibliography section, it was also challenging to create a balanced artificial dataset with multiline references. Instead, it was decided to train the model on data that does not contain the line separator characters at all:
70
+
71
+ ```python
72
+ lines = io.StringIO(references)
73
+ # normalization applied: strip lines and remove any extra space between lines
74
+ norm_doc = nlp(" ".join([line.strip() for line in lines if line.strip()]))
75
+ ```
76
+
77
+
78
+ ### Postprocessing
79
+ If your data never contains more than one reference per line, you can use SpanCat scores to estimate whether the next line is the next reference or it is the next part of the current multiline reference. See [code](https://huggingface.co/spaces/vitaly/bibliography-parser/blob/main/app.py) for additional details
80
+
81
+ ### Spaces App
82
+
83
+ [Bibliography Parser](https://huggingface.co/spaces/vitaly/bibliography-parser)
84
+
85
+
86
+ ### Label Scheme
87
+
88
+ Essentially the pipeline is the tokens classification task.
89
+
90
+ - NER Labels come from not overlapped CSL tags.
91
+ - SentenceRecognizer:`Token.is_sent_start=1` is set for the first token of each reference.
92
+ - SpanCategorizer: the 'bib' span is created for the first token of each reference. It is an alternative for SentenceRecognizer that returns scores.
93
+
94
+
95
+ <details>
96
+
97
+ <summary>View label scheme (13 labels for 2 components)</summary>
98
+
99
+ | Component | Labels |
100
+ | --- | --- |
101
+ | **`ner`** | `citation-label`, `citation-number`, `container-title`, `doi`, `family`, `given`, `issued`, `page`, `publisher`, `title`, `url`, `volume` |
102
+ | **`spancat`** | `bib` |
103
+
104
+ </details>
105
+
106
+ ### Accuracy
107
+
108
+ | Type | Score |
109
+ | --- | --- |
110
+ | `SENTS_F` | 96.19 |
111
+ | `SENTS_P` | 97.36 |
112
+ | `SENTS_R` | 95.04 |
113
+ | `ENTS_F` | 99.14 |
114
+ | `ENTS_P` | 99.26 |
115
+ | `ENTS_R` | 99.02 |
116
+ | `SPANS_SC_F` | 98.47 |
117
+ | `SPANS_SC_P` | 99.87 |
118
+ | `SPANS_SC_R` | 97.10 |
119
+ | `TRANSFORMER_LOSS` | 1042090.07 |
120
+ | `SENTER_LOSS` | 1079996.00 |
121
+ | `NER_LOSS` | 931993.00 |
122
+ | `SPANCAT_LOSS` | 119923.94 |