Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,9 @@ make decisions, such as classification, information retrieval, relation extracti
|
|
16 |
# Training data
|
17 |
|
18 |
The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
|
19 |
-
title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words.
|
|
|
|
|
20 |
|
21 |
# Training procedure
|
22 |
|
@@ -28,14 +30,16 @@ A domain specific vocabulary of size 31,620 is generated using SentencePiece byt
|
|
28 |
The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
|
29 |
pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
|
30 |
organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
|
31 |
-
to the properly tokenized and segmented sentences.
|
|
|
|
|
32 |
|
33 |
## Pretraining
|
34 |
|
35 |
The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
|
36 |
with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
|
37 |
-
After that the model is pre-trained for another 200,000 steps on PMC open access papers.
|
38 |
-
The training
|
39 |
12 transformers layers with hidden layer size of 768 and 12 attention heads.
|
40 |
|
41 |
|
|
|
16 |
# Training data
|
17 |
|
18 |
The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
|
19 |
+
title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words. The PMC open access corpus (January 2021) is
|
20 |
+
a 12.3 billion words corpus built using the sentences extracted from the sections of PMC open access papers
|
21 |
+
excluding the references sections.
|
22 |
|
23 |
# Training procedure
|
24 |
|
|
|
30 |
The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
|
31 |
pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
|
32 |
organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
|
33 |
+
to the properly tokenized and segmented sentences. For the PMC open access corpus, JATS XML files for the full text papers are parsed
|
34 |
+
to extract sections excluding the reference section and section title and section body is processed in the same fashion
|
35 |
+
as the PubMed abstracts corpus.
|
36 |
|
37 |
## Pretraining
|
38 |
|
39 |
The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
|
40 |
with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
|
41 |
+
After that, the model is pre-trained for another 200,000 steps on PMC open access papers.
|
42 |
+
The training parameters were the same as the original ELECTRA base model. The model has 110M parameters,
|
43 |
12 transformers layers with hidden layer size of 768 and 12 attention heads.
|
44 |
|
45 |
|