globuslabs commited on
Commit
f3806b3
1 Parent(s): ba39e9e

Update README

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -10,7 +10,7 @@ license: apache-2.0
10
 
11
  This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
12
 
13
- The model is pretrained on a large collection of scientific research articles (**2.2B tokens**).
14
  Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
15
 
16
  This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
@@ -30,7 +30,7 @@ The model is based on the same architecture as [BERT-large](https://huggingface.
30
 
31
  # Training Dataset
32
 
33
- The vocab and the model are pertrained on **100% of the PRD** scientific literature dataset.
34
 
35
  The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
36
  a nonprofit organization based in California. This dataset was constructed from a corpus
 
10
 
11
  This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
12
 
13
+ The model is pretrained on a large collection of scientific research articles (**22.1B tokens**).
14
  Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
15
 
16
  This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
 
30
 
31
  # Training Dataset
32
 
33
+ The vocab and the model are pertrained on **10% of the PRD** scientific literature dataset and Wikipedia+BookCorpus.
34
 
35
  The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
36
  a nonprofit organization based in California. This dataset was constructed from a corpus