Muthukumaran commited on
Commit
b00abdc
1 Parent(s): dcf0cad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -1
README.md CHANGED
@@ -1,3 +1,10 @@
1
- This model uses sci-bert for initial embedding and is trained using masked language modeling (MLM). The corpus is roughly 100,000 earth science based publications.
 
 
2
 
3
  Stay tuned for further downstream task tests and updates to the model.
 
 
 
 
 
 
1
+ This model is further trained on top of scibert-base using masked language modeling loss (MLM). The corpus is roughly 100,000 earth science-based publications.
2
+
3
+ The tokenizer used is AutoTokenizer, which is trained on the same corpus.
4
 
5
  Stay tuned for further downstream task tests and updates to the model.
6
+
7
+ in the works
8
+ - MLM + NSP task loss
9
+ - Add more data sources for training
10
+ - Test using downstream tasks