shahrukhx01 commited on
Commit
3dd0c57
1 Parent(s): 9b766f8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ tags:
4
+ - chemical-domain
5
+ - safety-datasheets
6
+ widget:
7
+ - text: "The removal of mercaptans, and for drying of gases and [MASK]."
8
+ ---
9
+ # BERT for Chemical Industry
10
+ A BERT-based language model further pre-trained from the checkpoint of [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased). We used a corpus of over 40,000+ technical documents from the **Chemical Industrial domain** and combined it with 13,000 Wikipedia Chemistry articles, ranging from Safety Data Sheets and Products Information Documents, with 250,000+ tokens from the Chemical domain and pre-trained using MLM and over 9.2 million paragraphs.
11
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
12
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
13
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
14
+ GPT internally masks the future tokens. It allows the model to learn a bidirectional representation of the
15
+ sentence.
16
+ ```python
17
+ from transformers import pipeline
18
+ fill_mask = pipeline(
19
+ "fill-mask",
20
+ model="recobo/chemical-bert-uncased",
21
+ tokenizer="recobo/chemical-bert-uncased"
22
+ )
23
+ fill_mask("we create [MASK]")
24
+ ```