bozyurt commited on
Commit
8c865d4
1 Parent(s): 765c9ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -1
README.md CHANGED
@@ -4,7 +4,7 @@ license: cc
4
 
5
  # Bio-ELECTRA Base 1m (cased)
6
 
7
- Pretrained (from scratch) ELECTRA model on 2021 Base PubMed abstracts with a domain specific word piece vocabulary generated using SentencePiece
8
  byte-pair-encoding (BPE) model from PubMed abstract texts. This model is case-sensitive: it makes a difference between english and English.
9
 
10
 
@@ -13,3 +13,63 @@ This model is mostly intended to be fine-tuned on a downstream biomedical domain
13
 
14
  Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence to
15
  make decisions, such as classification, information retrieval, relation extraction or question answering.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  # Bio-ELECTRA Base 1m (cased)
6
 
7
+ Pretrained (from scratch for one million steps) ELECTRA discriminator model on 2021 Base PubMed abstracts with a domain specific word piece vocabulary generated using SentencePiece
8
  byte-pair-encoding (BPE) model from PubMed abstract texts. This model is case-sensitive: it makes a difference between english and English.
9
 
10
 
 
13
 
14
  Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence to
15
  make decisions, such as classification, information retrieval, relation extraction or question answering.
16
+
17
+ # Training data
18
+
19
+ The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
20
+ title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words.
21
+
22
+ # Training procedure
23
+
24
+ The training procedure follows the original ELECTRA training.
25
+
26
+ ## Preprocessing
27
+
28
+ A domain specific vocabulary of size 31,620 is generated using SentencePiece byte-pair-encoding (BPE) model from PubMed abstract texts.
29
+ The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
30
+ pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
31
+ organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
32
+ to the properly tokenized and segmented sentences.
33
+
34
+ ## Pretraining
35
+
36
+ The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,000,000 steps
37
+ with a batch size of 256. The training paprameters were the same as the original ELECTRA base model. The model has 110M parameters,
38
+ 12 transformers layers with hidden layer size of 768 and 12 attention heads.
39
+
40
+
41
+ # BibTeX entry and citation info
42
+
43
+ ```
44
+ @inproceedings{ozyurt-etal-2021-detecting,
45
+ title = "Detecting Anatomical and Functional Connectivity Relations in Biomedical Literature via Language Representation Models",
46
+ author = "Ozyurt, Ibrahim Burak and
47
+ Menke, Joseph and
48
+ Bandrowski, Anita and
49
+ Martone, Maryann",
50
+ editor = "Beltagy, Iz and
51
+ Cohan, Arman and
52
+ Feigenblat, Guy and
53
+ Freitag, Dayne and
54
+ Ghosal, Tirthankar and
55
+ Hall, Keith and
56
+ Herrmannova, Drahomira and
57
+ Knoth, Petr and
58
+ Lo, Kyle and
59
+ Mayr, Philipp and
60
+ Patton, Robert M. and
61
+ Shmueli-Scheuer, Michal and
62
+ de Waard, Anita and
63
+ Wang, Kuansan and
64
+ Wang, Lucy Lu",
65
+ booktitle = "Proceedings of the Second Workshop on Scholarly Document Processing",
66
+ month = jun,
67
+ year = "2021",
68
+ address = "Online",
69
+ publisher = "Association for Computational Linguistics",
70
+ url = "https://aclanthology.org/2021.sdp-1.4",
71
+ doi = "10.18653/v1/2021.sdp-1.4",
72
+ pages = "27--35",
73
+ abstract = "Understanding of nerve-organ interactions is crucial to facilitate the development of effective bioelectronic treatments. Towards the end of developing a systematized and computable wiring diagram of the autonomic nervous system (ANS), we introduce a curated ANS connectivity corpus together with several neural language representation model based connectivity relation extraction systems. We also show that active learning guided curation for labeled corpus expansion significantly outperforms randomly selecting connectivity relation candidates minimizing curation effort. Our final relation extraction system achieves $F_1$ = 72.8{\%} on anatomical connectivity and $F_1$ = 74.6{\%} on functional connectivity relation extraction.",
74
+ }
75
+ ```