lfoppiano commited on
Commit
599c490
1 Parent(s): d46e33e

update readme

Browse files
Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -1,9 +1,19 @@
1
- # Material SciBERT (TPU)
2
-
3
- Goal: Improving language understanding in materials science
4
 
5
  **Work in progress**
6
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ## Related work
8
 
9
  ### BERT Implementations
 
1
+ # Material SciBERT (TPU): Improving language understanding in materials science
 
 
2
 
3
  **Work in progress**
4
 
5
+ ## Introduction
6
+ SciBERT-based model pre-trained with materials science scientific fulltexts
7
+
8
+ ## TLDR
9
+ - Collected full-text from ~700000 articles provided by the National Institute for Materials Science (NIMS) TDM platform (https://dice.nims.go.jp/services/TDM-PF/en/), dataset called ScienceCorpus (SciCorpus)
10
+ - We added to the SciBERT vocabulary (32k tokens), 100 domain-specific unknown words extracted from SciCorpus with a keywords modeler (KeyBERT)
11
+ - Starting conditions: original SciBERT weights
12
+ - Pre-train the model MatTpuSciBERT from on the Google Cloud with the TPU (Tensor Processing Unit) as follow:
13
+ - 800000 steps with batch_size: 256, max_seq_length:512
14
+ - 100000 steps with batch_size: 2048, max_seq_length:128
15
+ - Fine-tuning and testing on NER on superconductors (https://github.com/lfoppiano/grobid-superconductors) and physical quantities (https://github.com/kermitt2/grobid-quantities)
16
+
17
  ## Related work
18
 
19
  ### BERT Implementations