globuslabs commited on
Commit
cda10bb
1 Parent(s): b20aa56

Add README

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - science
5
+ - multi-displinary
6
+ license: apache-2.0
7
+ ---
8
+
9
+ # ScholarBERT_10 Model
10
+
11
+ This is the **ScholarBERT_1** variant of the ScholarBERT model family.
12
+
13
+ The model is pertrained on a large collection of scientific research articles (**22.1B tokens**).
14
+
15
+ This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
16
+
17
+ The model is based on the same architecture as [BERT-large](https://huggingface.co/bert-large-cased) and has a total of 340M parameters.
18
+
19
+
20
+ # Model Architecture
21
+
22
+ | Hyperparameter | Value |
23
+ |-----------------|:-------:|
24
+ | Layers | 24 |
25
+ | Hidden Size | 1024 |
26
+ | Attention Heads | 16 |
27
+ | Total Parameters | 340M |
28
+
29
+
30
+ # Training Dataset
31
+
32
+ The vocab and the model are pertrained on **10% of the PRD** scientific literature dataset.
33
+
34
+ The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
35
+ a nonprofit organization based in California. This dataset was constructed from a corpus
36
+ of journal article files, from which We successfully extracted text from 75,496,055 articles from 178,928 journals.
37
+ The articles span across Arts & Humanities, Life Sciences & Biomedicine, Physical Sciences,
38
+ Social Sciences, and Technology. The distribution of articles is shown below.
39
+
40
+ ![corpus pie chart](https://huggingface.co/globuslabs/ScholarBERT/resolve/main/corpus_pie_chart.png)
41
+
42
+
43
+ # BibTeX entry and citation info
44
+ If using this model, please cite this paper:
45
+
46
+ [To be added]