bozyurt commited on
Commit
dce4605
1 Parent(s): 27ce166

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
2
  license: cc
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc
3
+ language:
4
+ - en
5
  ---
6
+
7
+ # Bio-ELECTRA Mid 1.2m (cased)
8
+
9
+ Pretrained (from scratch for 1.2 million steps) mid-sized (50 million parameters) ELECTRA discriminator model on 2021 Base PubMed abstracts
10
+ and PMC open access papers with a domain specific word piece vocabulary generated using SentencePiece
11
+ byte-pair-encoding (BPE) model from PubMed abstract texts. This model is case-sensitive: it makes a difference between english and English.
12
+
13
+
14
+ # Intended uses & limitations
15
+ This model is mostly intended to be fine-tuned on a downstream biomedical domain task.
16
+
17
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence to
18
+ make decisions, such as classification, information retrieval, relation extraction or question answering.
19
+
20
+ # Training data
21
+
22
+ The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
23
+ title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words. The PMC open access corpus (January 2021) is
24
+ a 12.3 billion words corpus built using the sentences extracted from the sections of PMC open access papers
25
+ excluding the references sections.
26
+
27
+ # Training procedure
28
+
29
+ The training procedure follows the original ELECTRA training.
30
+
31
+ ## Preprocessing
32
+
33
+ A domain specific vocabulary of size 31,620 is generated using SentencePiece byte-pair-encoding (BPE) model from PubMed abstract texts.
34
+ The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
35
+ pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
36
+ organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
37
+ to the properly tokenized and segmented sentences. For the PMC open access corpus, JATS XML files for the full text papers are parsed
38
+ to extract sections excluding the reference section and section title and section body is processed in the same fashion
39
+ as the PubMed abstracts corpus.
40
+
41
+ ## Pretraining
42
+
43
+ The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
44
+ with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
45
+ After that, the model is pre-trained for another 200,000 steps on PMC open access papers.
46
+ The training parameters were the same as the original ELECTRA base model. The model has 50M parameters,
47
+ 12 transformers layers with hidden layer size of 512 and 8 attention heads.
48
+
49
+
50
+ # BibTeX entry and citation info
51
+
52
+ ```
53
+ @inproceedings{ozyurt-etal-2021-detecting,
54
+ title = "Detecting Anatomical and Functional Connectivity Relations in Biomedical Literature via Language Representation Models",
55
+ author = "Ozyurt, Ibrahim Burak and
56
+ Menke, Joseph and
57
+ Bandrowski, Anita and
58
+ Martone, Maryann",
59
+ editor = "Beltagy, Iz and
60
+ Cohan, Arman and
61
+ Feigenblat, Guy and
62
+ Freitag, Dayne and
63
+ Ghosal, Tirthankar and
64
+ Hall, Keith and
65
+ Herrmannova, Drahomira and
66
+ Knoth, Petr and
67
+ Lo, Kyle and
68
+ Mayr, Philipp and
69
+ Patton, Robert M. and
70
+ Shmueli-Scheuer, Michal and
71
+ de Waard, Anita and
72
+ Wang, Kuansan and
73
+ Wang, Lucy Lu",
74
+ booktitle = "Proceedings of the Second Workshop on Scholarly Document Processing",
75
+ month = jun,
76
+ year = "2021",
77
+ address = "Online",
78
+ publisher = "Association for Computational Linguistics",
79
+ url = "https://aclanthology.org/2021.sdp-1.4",
80
+ doi = "10.18653/v1/2021.sdp-1.4",
81
+ pages = "27--35",
82
+ abstract = "Understanding of nerve-organ interactions is crucial to facilitate the development of effective bioelectronic treatments. Towards the end of developing a systematized and computable wiring diagram of the autonomic nervous system (ANS), we introduce a curated ANS connectivity corpus together with several neural language representation model based connectivity relation extraction systems. We also show that active learning guided curation for labeled corpus expansion significantly outperforms randomly selecting connectivity relation candidates minimizing curation effort. Our final relation extraction system achieves $F_1$ = 72.8{\%} on anatomical connectivity and $F_1$ = 74.6{\%} on functional connectivity relation extraction.",
83
+ }
84
+ ```