bozyurt commited on
Commit
ad66221
1 Parent(s): 6ad5442

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -16,7 +16,9 @@ make decisions, such as classification, information retrieval, relation extracti
16
  # Training data
17
 
18
  The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
19
- title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words.
 
 
20
 
21
  # Training procedure
22
 
@@ -28,14 +30,16 @@ A domain specific vocabulary of size 31,620 is generated using SentencePiece byt
28
  The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
29
  pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
30
  organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
31
- to the properly tokenized and segmented sentences.
 
 
32
 
33
  ## Pretraining
34
 
35
  The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
36
  with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
37
- After that the model is pre-trained for another 200,000 steps on PMC open access papers.
38
- The training paprameters were the same as the original ELECTRA base model. The model has 110M parameters,
39
  12 transformers layers with hidden layer size of 768 and 12 attention heads.
40
 
41
 
 
16
  # Training data
17
 
18
  The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
19
+ title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words. The PMC open access corpus (January 2021) is
20
+ a 12.3 billion words corpus built using the sentences extracted from the sections of PMC open access papers
21
+ excluding the references sections.
22
 
23
  # Training procedure
24
 
 
30
  The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
31
  pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
32
  organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
33
+ to the properly tokenized and segmented sentences. For the PMC open access corpus, JATS XML files for the full text papers are parsed
34
+ to extract sections excluding the reference section and section title and section body is processed in the same fashion
35
+ as the PubMed abstracts corpus.
36
 
37
  ## Pretraining
38
 
39
  The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
40
  with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
41
+ After that, the model is pre-trained for another 200,000 steps on PMC open access papers.
42
+ The training parameters were the same as the original ELECTRA base model. The model has 110M parameters,
43
  12 transformers layers with hidden layer size of 768 and 12 attention heads.
44
 
45