File size: 2,802 Bytes
75b2631
 
 
 
7298339
9bfe9f0
 
ceba297
 
dbbdbcb
 
a56104d
9bfe9f0
75b2631
089d787
 
63cb0f6
089d787
 
 
 
51cb62c
089d787
17d47f2
63cb0f6
089d787
17d47f2
 
 
2a45f64
17d47f2
89aaa46
2a45f64
17d47f2
89aaa46
 
4b74b2f
dbfa68a
5d63c09
dbfa68a
 
 
 
89aaa46
df1516e
 
 
 
 
17d47f2
 
 
089d787
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
tags:
- molecular language model
- SELFIES
- molecule generation
widget:
  - text: "[C][=C][C][=C][C][=C][Ring1][=Branch1]"
inference:
  parameters:
    repetition_penalty: 100
    num_return_sequences: 5
    
  
---
# MolGen
MolGen was introduced in the paper ["Molecular Language Model as Multi-task Generator"](https://arxiv.org/pdf/2301.11259.pdf) and first released in [this repository](https://github.com/zjunlp/MolGen). It is a pre-trained molecular generative model built using the 100\% robust molecular language representation, SELFIES.

## Model description
MolGen is the first pre-trained model that only produces chemically valid molecules. 
With a training corpus of over 100 million molecules in SELFIES representation, MolGen learns the intrinsic structural patterns of molecules by mapping corrupted SELFIES to their original forms.
Specifically, MolGen employs a bidirectional Transformer as its encoder and an autoregressive Transformer as its decoder.
Through its carefully designed multi-task molecular prefix tuning (MPT), MolGen can generate molecules with desired properties, making it a valuable tool for molecular optimization.

## Intended uses
You can use the raw model for molecular generation or fine-tune it to a downstream task. See the [repository](https://github.com/zjunlp/MolGen) to look for fine-tune details on a task that interests you.

### How to use
Molecule generation example:
```python
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen")
>>> model = AutoModelForSeq2SeqLMq.from_pretrained("zjunlp/MolGen")

>>> sf_input = tokenizer("[C][=C][C][=C][C][=C][Ring1][=Branch1]", return_tensors="pt")
>>> # beam search
>>> molecules = model.generate(input_ids=sf_input["input_ids"],
                              attention_mask=sf_input["attention_mask"],
                              max_length=15,
                              min_length=5,
                              num_return_sequences=5,
                              num_beams=5,
                              past_prompt=None)
>>> sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]
['[C][=C][C][=C][C][=C][Ring1][=Branch1]',
'[C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]',
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C][=C][C][=C]',
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C@H1][C][=C][C]',
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C@H1][=C][C][=C]']
```


### BibTeX entry and citation info
```bibtex
@article{fang2023molecular,
  title={Molecular Language Model as Multi-task Generator},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
  journal={arXiv preprint arXiv:2301.11259},
  year={2023}
}
```