File size: 1,770 Bytes
d7f70b6
 
 
 
 
 
 
 
ffcb669
225fb74
 
 
ca17666
 
 
 
a616562
ca17666
a616562
ca17666
 
 
 
 
 
 
c3532cc
 
 
 
 
ca17666
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
license: cc-by-nc-4.0
language:
- en
library_name: transformers
tags:
- chemistry
- biology
---
Chemlactica-125m is a continually pretrained [galactica-125m](https://huggingface.co/facebook/galactica-125m) model for organic molecules. 
It is pretrained on (soon-to-be-released) 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties 
(molecular weight, synthetic accessibility score, drug-likeness etc.) 
and similarities (Tanimoto distance between ECFP fingerprints).

Example prompts:

`</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS]` will attempt to predict the synthetic accessibility score of the given molecule.

`</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and 
has a 0.62 similarity score to the given molecule.

The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. 

A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization 
and other benchmarks will be released soon.

Few notes:
* All queries should start with `</s>` symbol.
* All numbers are rounded to two decimal points.
* All SMILES are canonicalized using `rdkit`.
* Available tags: `[CLOGP]`, `[WEIGHT]`, `[QED]`, `[SAS]`, `[TPSA]`, `[RINGCOUNT]`, `[SIMILAR]`...

The model is part of the 3-model family: [Chemlactica-125M](https://huggingface.co/yerevann/chemlactica-125m), 
[Chemlactica-1.3B](https://huggingface.co/yerevann/chemlactica-1.3b) and [Chemma-2B](https://huggingface.co/yerevann/chemma-2b).

We are looking forward to see the community using the model in new applications and contexts.