phenixace's picture
Update README.md
d6226d5 verified
metadata
license: apache-2.0

ICMA version of galactica-125M for text-based molecule generation task (Cap2Mol) for paper "Large Language Models are In-Context Molecule Learners"

Notice: The input should contain 4 context examples and the cutoff length should be set to 2048 to ensure best performance.

A simple inference example

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("phenixace/ICMA-Galactica-125M-M2C")

from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("phenixace/ICMA-Galactica-125M-M2C")

text ="""Generate a molecule for the caption: The molecule is a dicarboxylic acid monoester that is the 21-(hydrogen succinate) derivative of 11-deoxycorticosterone. It is a 3-oxo-Delta(4) steroid, a 20-oxo steroid, a dicarboxylic acid monoester, a steroid ester and a hemisuccinate. It derives from an 11-deoxycorticosterone and a succinic acid.
Molecule: C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2C(=O)COC(=O)CCC(=O)O)CCC4=CC(=O)CC[C@]34C

Generate a molecule for the caption: The molecule is a fluorinated steroid that is 9-fluoropregna-1,4-diene substituted by hydroxy groups at positions 11, 17 and 21, a methyl group at position 16 and oxo groups at positions 3 and 20. It is a synthetic member of the class of glucocorticoids. It has a role as an adrenergic agent, an antiemetic, an antineoplastic agent, an environmental contaminant, a xenobiotic, an immunosuppressive agent and an anti-inflammatory drug. It is a fluorinated steroid, a 3-oxo-Delta(1),Delta(4)-steroid, a glucocorticoid, a 20-oxo steroid, an 11beta-hydroxy steroid, a 17alpha-hydroxy steroid and a 21-hydroxy steroid. It derives from a hydride of a pregnane.
Molecule: C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@@]4([C@]3([C@H](C[C@@]2([C@]1(C(=O)CO)O)C)O)F)C

Generate a molecule for the caption: The molecule is a fluorinated steroid that is pregn-4-ene substituted by a fluoro group at position 2, a methyl group at position 2 and oxo groups at positions 3, 11 and 20. It is a 3-oxo-Delta(4) steroid, an 11-oxo steroid, a 20-oxo steroid and a fluorinated steroid. It derives from a progesterone. It derives from a hydride of a pregnane.
Molecule: C[C@@H]1C[C@]2(C(=CC1=O)CC[C@@H]3[C@@]2(C(=O)C[C@]4([C@H]3CC[C@@H]4C(=O)C)C)F)C

Generate a molecule for the caption: The molecule is a steroid ester that is pregn-4-en-21-yl acetate substituted by oxo group at positions 3 and 20, a methyl group at position 6 and hydroxy groups at positions 11 and 17 respectively. It is a 3-oxo-Delta(4) steroid, a steroid ester, an 11beta-hydroxy steroid, a 17alpha-hydroxy steroid, a 20-oxo steroid and a tertiary alpha-hydroxy ketone. It derives from a hydride of a pregnane.
Molecule: C[C@H]1C[C@H]2[C@@H]3CC[C@@]([C@]3(C[C@@H]([C@@H]2[C@@]4(C1=CC(=O)CC4)C)O)C)(C(=O)COC(=O)C)O

Based on the above examples, analyse the similarities and differences between the examples and finally generate a molecule for the caption: The molecule is a steroid ester that is methyl (17E)-pregna-4,17-dien-21-oate substituted by oxo groups at positions 3 and 11. It is a 3-oxo-Delta(4) steroid, an 11-oxo steroid, a steroid ester and a methyl ester. It derives from a hydride of a pregnant."""

generation_config = GenerationConfig(
            do_sample=True,
            temperature=0.7,
            top_p=0.85,
            top_k=40,
            num_beams=1,
            repetition_penalty=1.0,
            pad_token_id=0,
        )
inputs = tk(text, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True, num_return_sequences=1, max_new_tokens=256, generation_config=generation_config)

# decode
decoded = tk.decode(outputs.sequences[0], skip_special_tokens=True)
print(decoded)

Paper Link: https://arxiv.org/abs/2403.04197