ChemFIE-BED (ChemSELFIES Embedding)
ChemFIE-BED is a sentence-transformers based on gbyuvd/chemselfies-base-bertmlm fine-tuned on around (for now) 2 million pairs of valid molecules' SELFIES (Krenn et al. 2020) taken from COCONUTDB (Sorokina et al. 2021) and (Zdrazil et al. 2023). It maps compounds' Self-Referencing Embedded Strings (SELFIES) into a 320-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
Although there is more data for the model to train on, the test metrics on unseen data of combined natural products and bioactives are already sufficient for now.
This model is the full implementation of Tom Aarsen's suggestions on previous prototype model, now using my own pre-trained BERT and Matryoshka embeddings. For the latter, the model uses 320, 160, and 80 dimension that you can truncate depending on your needs.
For more informations:
- On SELFIES:
- blogpost or check out their github.
- On Sentence Transformer:
- On Matryoshka embedding model:
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: gbyuvd/chemselfies-base-bertmlm
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 320 tokens
- Similarity Function: Cosine Similarity
- Pooling: Mean pooling
- Training Dataset: SELFIES pairs generated from COCONUTDB and ChemBL34
- Language: SELFIES
- License: CC-BY-NC-SA 4.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Specify preffered dimensions
# 320, 160, or 80
dimensions = 320
# Download the model from the 🤗 Hub
model = SentenceTransformer("gbyuvd/chembed-chemselfies-bed", truncate_dim=dimensions)
# Run inference
sentences = [
'[C] [C] [=C] [C] [=C] [Branch2] [Ring2] [S] [C] [C] [N] [C] [=Branch1] [C] [=O] [C] [=C] [N] [Branch1] [C] [C] [C] [=C] [C] [=C] [Branch2] [Ring1] [Ring1] [S] [=Branch1] [C] [=O] [=Branch1] [C] [=O] [N] [C] [C] [C] [Branch1] [C] [C] [C] [C] [Ring1] [#Branch1] [C] [=C] [Ring1] [S] [C] [Ring2] [Ring1] [Branch1] [=O] [C] [=C] [Ring2] [Ring1] [P]',
'[O] [=C] [Branch1] [C] [O] [C] [C] [C] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [=C]',
'[C] [N] [C] [=N] [C] [Branch2] [Branch1] [C] [S] [=Branch1] [C] [=O] [=Branch1] [C] [=O] [N] [Branch1] [#Branch2] [C] [C] [C] [C] [N] [C] [C] [Ring1] [=Branch1] [C] [C] [C] [=C] [C] [Branch1] [Ring1] [C] [#N] [=C] [C] [=C] [Ring1] [Branch2] [N] [Branch1] [#Branch2] [C] [C] [=C] [N] [=C] [N] [Ring1] [Branch1] [C] [C] [Ring2] [Ring1] [Ring1] [=C] [Ring2] [Ring2] [Ring1]',
]
"""
0: CHEMBL1885710
1: CID78383937
2: CHEMBL234161
"""
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 320]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Dataset
Dataset | Reference | Total Number of Pairs |
---|---|---|
COCONUTDB | (Sorokina et al. 2021) | 1,183,186 |
ChemBL34 (Part I) | (Zdrazil et al. 2023) | 1,064,858 |
Evaluation
Metrics
Semantic Similarity
- Dataset:
combined-test
- Number of test pairs: 898,980
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.9605 |
spearman_cosine | 0.9520 |
pearson_manhattan | 0.8788 |
spearman_manhattan | 0.8587 |
pearson_euclidean | 0.8802 |
spearman_euclidean | 0.8612 |
pearson_dot | 0.8414 |
spearman_dot | 0.8421 |
pearson_max | 0.9605 |
spearman_max | 0.9520 |
Recommendations
To fully utilize the model capabitilities on a large dataset for similarity search, I'd recommend using Meta's FAISS for rapid results or any of your preferred document retrieval framework.
Training Details
Training Hyperparameters
optimizer
: AdamWeval_strategy
: epochper_device_train_batch_size
: 64per_device_eval_batch_size
: 32weight_decay
: 0.01num_train_epochs
: 1warmup_ratio
: 0.1dataloader_num_workers
: 8
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "CosineSimilarityLoss", "matryoshka_dims": [ 320, 160, 80 ], "matryoshka_weights": [ 1, 1, 1 ], "n_dims_per_step": -1 }
Training Logs
Natural Products
Epoch | Step | Training Loss | loss | NPiso-base-test_spearman_cosine |
---|---|---|---|---|
0.2771 | 4099 | 0.0243 | - | - |
0.5543 | 8198 | 0.0099 | - | - |
0.8314 | 12297 | 0.0083 | - | - |
1.0 | 14790 | - | 0.0074 | 0.9548 |
Combined I
Epoch | Step | Training Loss | loss | All-base-test_spearman_cosine |
---|---|---|---|---|
0.2737 | 4099 | 0.0111 | - | - |
0.5474 | 8198 | 0.0086 | - | - |
0.8212 | 12297 | 0.0077 | - | - |
1.0 | 14975 | - | 0.0072 | 0.9516 |
Testing The Generated Embedding to Find Similar Molecules
Using Atolypene A as the query molecule, I used FAISS (Facebook AI Similarity Search) on the pre-embedded SELFIES representations of 0.5M molecules from COCONUTDB and HerbalDB to find top-10 most similar molecules based on their cosine similarities. It took 50mins to generate the embeddings of said database with my laptop's NVIDIA GeForce 930M (using 64 batch_size).
top 10 (returned in 3.9s with visualization):
or you can take multiple inputs then average their embeddings to find those most similar. For example, using 5 samples of MRSA-antibiotics: Vancomycin, Linezolid, Tigecycline, and Ceftobiprole
then query similars based on the average embeddings:
Testing Generated Embeddings' Clusters
The plot below shows how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
Using perplexity of 20 over 5500 iterations. 2D:
3D:
For a more simple separation between two active nAChR-a4b2 agonist vs anticoagulants (perplexity = 5):
And for more data points and classes (perplexity = 7):
Framework Versions
- Python: 3.9.13
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Accelerate: 0.33.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Citation
BibTeX
ChemFIE-Base
@software{chemfie_basebertmlm,
author = {GP Bayu},
title = {{ChemFIE Base}: Pretraining A Lightweight BERT-like model on Molecular SELFIES},
url = {https://huggingface.co/gbyuvd/chemselfies-base-bertmlm},
version = {1.0},
year = {2024},
}
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
COCONUTDB
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
ChemBL34
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
Contact & Support My Work
G Bayu (gbyuvd@proton.me)
This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
If you find my work valuable and would like to support my journey, please consider supporting me here. Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.
- Downloads last month
- 54
Collection including gbyuvd/chemembed-chemselfies
Evaluation results
- Pearson Cosine on combined testself-reported0.961
- Spearman Cosine on combined testself-reported0.952
- Pearson Manhattan on combined testself-reported0.879
- Spearman Manhattan on combined testself-reported0.859
- Pearson Euclidean on combined testself-reported0.881
- Spearman Euclidean on combined testself-reported0.861
- Pearson Dot on combined testself-reported0.841
- Spearman Dot on combined testself-reported0.842
- Pearson Max on combined testself-reported0.961
- Spearman Max on combined testself-reported0.952