Molecule retrieval and editing using multimodal text-structure representations

Community Article Published February 26, 2024


Man-made materials are everywhere in our daily lives: from the prescription drugs we take when we get sick, to the pesticides and preservatives used in food production, to the clothes we wear and the transportation we use. The search space for molecular design is simply enormous. In the case of drugs alone, it is believed that there are about 10^23 to 10^60 possible compounds, and only about 10^8 have been synthesized to date. This makes molecular design one of the most fundamental and complex challenges facing mankind. Of course, with recent advances in AI and foundation models, we now have powerful tools to tackle it. Imagine the next generation of antibiotics, energy dense batteries, or environmentally friendly plastics developed 10x faster than before. The potential for positive impact on humanity and the planet of this discipline is huge! In this edition of Frontier AI, join me as we explore the fascinating world of AI-powered material design.

Molecule editing by text prompts in a glimpse

There are several ways to represent chemical compounds using computers. The most widely used are 2D graphs, 3D molecule structures, and also a text representation called smiles (and its variants like selfies, and SAFE). In the context of molecular design, multi-modality consist into associate several of these representations simultaneously. The huge challenge is to ensure that the embedded representation in the latent space actually share semantic properties across modalities. As such, we want the embeddings of the molecule structures to be close to their text description and smiles in the latent space.

Example of representation of a molecule (Aspirine). From left to right: 3D graph, 2D structure, smiles and text description. Source: [Pubchem](

Example of representation of a molecule (Aspirine). From left to right: 3D graph, 2D structure, smiles and text description. Source: Pubchem.

Turns out that generating multi-modal embeddings is not an easy task when using different pretrained encoders for each modality. A way to solve this issue is doing contrastive learning between pairs of embeddings of molecule structures and text descriptions. The paper Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing shows a great example of this. First, the authors generate embeddings of text and molecules using separate pretrained encoders. Then they project the embeddings into a single meaningful latent space via contrastive learning. The remarkable part is process is all the interesting applications that enable these embeddings. For example, we can find molecular structures in a dataset given a textual description (which is useful for drug screening and repurposing). We can also modify a molecule's structure by a textual prompt so that it satisfies a desired property in a zero-shot fashion (yes, that's possible!). The authors tested the technique usign the widely known MoleculeNet framework. This benchmark is specifically designed to test machine learning methods that predict molecular properties such as solubility, toxicity, atomization energy, HOMO/LUMO, and more.

Description of the multi-modal (structure and text) contrastive learning approach to generate semantic embeddings. Source: [Liu and Al. 2023](

Description of the multi-modal (structure and text) contrastive learning approach to generate semantic embeddings. Source: Liu and Al., 2023.

Molecule retrieval with text prompts

Let's say you have a huge dataset of molecules (either 2 and 3D structures or smiles) and you want to find some that might be interesting for your application. In this case, all you have to do is generate the embeddings of the structures and query them. Then you compute a similarity score to find the most promising candidates. This process is bi-directional across modalities. Therefore, you can also find a description that matches your structure, provided you have a representative dataset of the domain you are working on. The main applications of Molecular Retrieval are:

  • Verify that your molecule already exists or is in commercial use.
  • Find existing molecules that can be repurposed for your intended application.
  • Generate a text description of an unkown structure.
Schema of multi-modal molecule retrieval (structure and text)

Schematic of multi-modal molecule retrieval (structure and text). Source: Liu and Al., 2023.

Zero-shot molecule editing

Now suppose you have a known molecular structure and you want to modify it to increase some relevant property. For example, you may want to make a compound more water soluble, or increase or decrease a given mechanical property. In this setting, the authors first encode the original molecule and the text prompt of the desired property. The next step is to directly learn a latent code that is simultaneously close to the embeddings of the original molecule and the text description. This is done using two similarity scores (one for text and one for the structure) as objective function. This is formally written as following:

w=argminwW(Lcosine-sim(g2f(w),ptft(xt))+λL2(w,fg(xc,in)))w = \text{argmin}_{w \in W} \left( -L_{\text{cosine-sim}}(g_2 f(w), p_t \circ f_t(x_t)) + \lambda \cdot L_{\ell2}(w, f_g(x_c, in)) \right)

Where W is the latent code space, Lcosine-sim is the cosine-similarity, and Ll2 is the l2 distance, and λ is a coefficient to balance. The resulting latent vector is closer to the derised text description but not very far from the original structure.

Schematic of zero-shot molecule editing. Source: [Liu and Al. 2023](

Schematic of zero-shot molecule editing. Source: Liu and Al., 2023.

After decoding the latent code with a BERT-based pretrained decoder we obtain the modified structure that satisfies the desired property specified by the text prompt. The figure below shows some examples of multi-objective molecule editing. Multi-objective means that two or more properties are optimized simultaneously (e.g. make the molecule soluble in water and reduce its permeability). This is possible in a zero-shot fashion because the optimization is done in the latent space, using the embedding of the text query. The modified fragment of the structure is highlighted in pink (original) and purple (replacement). The predicted properties are reported, confirming the expansion/reduction according to the text prompt.

Examples of zero-shot molecule editing with text prompts

Examples of zero-shot molecule editing with text prompts
Source: Liu and Al., 2023.

The potential applications of molecular editing are many:

  • Slightly modify an existing compound to improve its properties (e.g., reduce resistance to an antibiotic.).
  • Provide insight and explanation as to which functional groups are associated with the increase/decrease of a given property.
  • Generate non trivial alternatives to existing patented compounds faster.

Going forward

And that's it! I hope you enjoyed reading this article and learned something new about the fascinating field of AI for drug discovery. Kudos to Shengchao Liu and team for this remarkable work! Here are the links to the artifacts (MIT licence :)):

Follow me on HF 🤗, Linkedin and stay tuned for the next edition of Frontier AI!