File size: 1,320 Bytes
3dd0c57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
language: "en"
tags:
- chemical-domain
- safety-datasheets
widget:
- text: "The removal of mercaptans, and for drying of gases and [MASK]."
---
# BERT for Chemical Industry
A BERT-based language model further pre-trained from the checkpoint of [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased). We used a corpus of over 40,000+ technical documents from the **Chemical Industrial domain** and combined it with 13,000 Wikipedia Chemistry articles, ranging from Safety Data Sheets and Products Information Documents, with 250,000+ tokens from the Chemical domain and pre-trained using MLM and over 9.2 million paragraphs.
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
  the entire masked sentence through the model and has to predict the masked words. This is different from traditional
  recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
  GPT internally masks the future tokens. It allows the model to learn a bidirectional representation of the
  sentence.
```python
from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="recobo/chemical-bert-uncased",
    tokenizer="recobo/chemical-bert-uncased"
)
fill_mask("we create [MASK]")
```