update readme
Browse files
README.md
CHANGED
@@ -1,9 +1,19 @@
|
|
1 |
-
# Material SciBERT (TPU)
|
2 |
-
|
3 |
-
Goal: Improving language understanding in materials science
|
4 |
|
5 |
**Work in progress**
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
## Related work
|
8 |
|
9 |
### BERT Implementations
|
|
|
1 |
+
# Material SciBERT (TPU): Improving language understanding in materials science
|
|
|
|
|
2 |
|
3 |
**Work in progress**
|
4 |
|
5 |
+
## Introduction
|
6 |
+
SciBERT-based model pre-trained with materials science scientific fulltexts
|
7 |
+
|
8 |
+
## TLDR
|
9 |
+
- Collected full-text from ~700000 articles provided by the National Institute for Materials Science (NIMS) TDM platform (https://dice.nims.go.jp/services/TDM-PF/en/), dataset called ScienceCorpus (SciCorpus)
|
10 |
+
- We added to the SciBERT vocabulary (32k tokens), 100 domain-specific unknown words extracted from SciCorpus with a keywords modeler (KeyBERT)
|
11 |
+
- Starting conditions: original SciBERT weights
|
12 |
+
- Pre-train the model MatTpuSciBERT from on the Google Cloud with the TPU (Tensor Processing Unit) as follow:
|
13 |
+
- 800000 steps with batch_size: 256, max_seq_length:512
|
14 |
+
- 100000 steps with batch_size: 2048, max_seq_length:128
|
15 |
+
- Fine-tuning and testing on NER on superconductors (https://github.com/lfoppiano/grobid-superconductors) and physical quantities (https://github.com/kermitt2/grobid-quantities)
|
16 |
+
|
17 |
## Related work
|
18 |
|
19 |
### BERT Implementations
|