bozyurt
/

bio-electra-base-1_2m

Inference Endpoints

Model card Files Files and versions Community

bozyurt commited on Dec 25, 2023

Commit

ad66221

•

1 Parent(s): 6ad5442

Update README.md

Files changed (1) hide show

README.md +8 -4

README.md CHANGED Viewed

@@ -16,7 +16,9 @@ make decisions, such as classification, information retrieval, relation extracti
 # Training data
 The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
-title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words.
 # Training procedure
@@ -28,14 +30,16 @@ A domain specific vocabulary of size 31,620 is generated using SentencePiece byt
 The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
 pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
 organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
-to the properly tokenized and segmented sentences.
 ## Pretraining
  The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
  with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
- After that the model is pre-trained for another 200,000 steps on PMC open access papers.
- The training paprameters were the same as the original ELECTRA base model. The model has 110M parameters,
  12 transformers layers with hidden layer size of 768 and 12 attention heads.

 # Training data
 The pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. To build the corpus,
+title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words. The PMC open access corpus (January 2021) is
+a 12.3 billion words corpus built using the sentences extracted from the sections of PMC open access papers
+excluding the references sections.
 # Training procedure
 The title and abstract text sentences were extracted using an in-house sentence segmenter trained on biomedical text. The sentences are
 pre-tokenized using an in-house biomedical tokenizer for proper tokenization of biomedical entities such as gene/protein names,
 organisms, antibodies, cell lines. The SentencePiece BPE vocabulary of word pieces are applied during pre-training
+to the properly tokenized and segmented sentences. For the PMC open access corpus, JATS XML files for the full text papers are parsed
+to extract sections excluding the reference section and section title and section body is processed in the same fashion
+as the PubMed abstracts corpus.
 ## Pretraining
  The model is pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM for 1,200,000 steps
  with a batch size of 256. The first 1,000,000 steps are pre-trained on PubMed abstracts.
+ After that, the model is pre-trained for another 200,000 steps on PMC open access papers.
+ The training parameters were the same as the original ELECTRA base model. The model has 110M parameters,
  12 transformers layers with hidden layer size of 768 and 12 attention heads.