File size: 1,111 Bytes
80d7c0c
 
6b4b199
6663bcb
756af86
 
cb2a590
d7fa639
 
 
756af86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7fa639
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: bigscience-openrail-m
widget:
- text: O=C([C@@H](c1ccc(cc1)O)N)[MASK][C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C
datasets:
- ChEMBL
pipeline_tag: fill-mask
tags:
- biology
- medical
---

# BERT base for SMILES
This is bidirectional transformer pretrained on SMILES (simplified molecular-input line-entry system) strings. 

Example: Amoxicillin
```
O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C
```

Two training objectives were used: 
1. masked language modeling
2. molecular-formula validity prediction

## Intended uses
This model is primarily aimed at being fine-tuned on the following tasks:
- molecule classification
- molecule-to-gene-expression mapping
- cell targeting

## How to use in your code
```python
from transformers import BertTokenizerFast, BertModel
checkpoint = 'unikei/bert-base-smiles'
tokenizer = BertTokenizerFast.from_pretrained(checkpoint)
model = BertModel.from_pretrained(checkpoint)

example = 'O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C'
tokens = tokenizer(example, return_tensors='pt')
predictions = model(**tokens)
```