Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Chemlactica-125m is a continually pretrained galactica-125m model for organic molecules. It is pretrained on (soon-to-be-released) 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties (molecular weight, synthetic accessibility score, drug-likeness etc.) and similarities (Tanimoto distance between ECFP fingerprints).

Example prompts:

</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS] will attempt to predict the synthetic accessibility score of the given molecule.

</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES] will attempt to generate a molecule that has 2.25 SAS score and has a 0.62 similarity score to the given molecule.

The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts.

A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization and other benchmarks will be released soon.

Few notes:

  • All queries should start with </s> symbol.
  • All numbers are rounded to two decimal points.
  • All SMILES are canonicalized using rdkit.
  • Available tags: [CLOGP], [WEIGHT], [QED], [SAS], [TPSA], [RINGCOUNT], [SIMILAR]...

The model is part of the 3-model family: Chemlactica-125M, Chemlactica-1.3B and Chemma-2B.

We are looking forward to see the community using the model in new applications and contexts.

Downloads last month
4
Safetensors
Model size
125M params
Tensor type
F32
·