gbyuvd
/

miniChembed-prototype

+---
+license: mit
+tags:
+- sentence-transformers
+- chemistry
+- molecular-similarity
+- cheminformatics
+- unsupervised-learning
+- smiles
+- feature-extraction
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# miniChembed-prototype
+This is a **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including:
+- **ChEMBL34** (Zdrazil et al., 2023)
+- **COCONUTDB** (Sorokina et al., 2021)
+- **SuperNatural3** (Gallo et al., 2023)
+The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**.
+Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules.
+The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations.
+---
+## Model Details
+### Architecture & Training
+| Attribute | Value |
+|----------|-------|
+| **Base architecture** | Custom RoBERTa-style transformer (4 layers, 320 hidden dim, 4 attention heads, ~4M params) |
+| **Initialization** | Random (not pretrained on text or chemistry) |
+| **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix |
+| **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) |
+| **Training data** | ~24K unique molecules → augmented into positive pairs |
+| **Sequence length** | 512 tokens |
+| **Embedding dimension** | 320 |
+| **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) |
+| **Pooling** | Mean pooling over token embeddings |
+| **Similarity metric** | Cosine similarity |
+| **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) |
+| **Learning rate** | 1e-4 |
+| **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) |
+| **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) |
+| **Barlow λ** | 5.0 (stronger off-diagonal penalty) |
+| **Training duration** | 5 epochs |
+| **Hardware** | Single NVIDIA 930MX GPU |
+### Architecture (SentenceTransformer format)
+```python
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
+  (1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+> 🔍 **Note**: The model was **not initialized from a language model**—it is trained from scratch on SMILES using only the Barlow Twins objective.
+---
+## Usage
+### Installation
+```bash
+pip install -U sentence-transformers rdkit-pypi
+```
+### Encoding Molecules
+```python
+from sentence_transformers import SentenceTransformer
+# Load from Hugging Face Hub
+model = SentenceTransformer("gbyuvd/miniChembed-prototype")
+# Encode SMILES
+sentences = [
+    'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3',  # Cytisine
+    "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4",   # Varenicline
+    "c1ncccc1[C@@H]2CCCN2C",                 # Nicotine
+    'Nc1nc2cncc-2co1',                       # CID: 162789184
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)  # (4, 320)
+# Compute pairwise cosine similarities
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
+# tensor([[1.0000, 0.4342, 0.5141, 0.2582],
+#         [0.4342, 1.0000, 0.8779, 0.8886],
+#         [0.5141, 0.8779, 1.0000, 0.9551],
+#         [0.2582, 0.8886, 0.9551, 1.0000]])
+```
+High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
+> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
+---
+## Comparison to Traditional Fingerprints
+| Feature | ECFP4 / MACCS | miniChembed-prototype |
+|--------|----------------|------------------------|
+| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
+| **Training data** | None (rule-based) | ~24K unlabeled SMILES |
+| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
+| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
+---
+## Training Summary
+- **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views.
+- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) ��� mean(cross-molecule cosine)`
+  → Higher = better separation between intra- and inter-molecular similarity.
+- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
+- **Final health**: , indicating strong disentanglement.
+---
+## Limitations
+- Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown.
+- Input must be **valid SMILES**; invalid strings may produce erratic embeddings.
+- **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function.
+- Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective.
+---
+## Reproducibility
+This model was trained using a custom script based on **Sentence Transformers v5.1.0**, with the following environment:
+- **Python**: 3.13.0
+- **sentence-transformers**: 5.1.0
+- **PyTorch**: 2.6.0
+- **RDKit**: 2023.09.3
+- **Optimizer**: Ranger21 (with epoch-aware warmup/warmdown)
+- **Loss**: Custom `BarlowTwinsLoss` (λ = 5.0)
+- **Augmentation**: RDKit-based stochastic SMILES
+Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
+---
+## Reference:
+Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES.
+```
+@misc{çağatan2024unseeunsupervisednoncontrastivesentence,
+      title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings},
+      author={Ömer Veysel Çağatan},
+      year={2024},
+      eprint={2401.15316},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2401.15316},
+}
+```
+---
+## Citation
+If you use this model, please cite:
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+  author = "Reimers, Nils and Gurevych, Iryna",
+  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+  year = "2019",
+  url = "https://arxiv.org/abs/1908.10084"
+}
+@article{sorokina2021coconut,
+  title={COCONUT online: Collection of Open Natural Products database},
+  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
+  journal={Journal of Cheminformatics},
+  volume={13},
+  number={1},
+  pages={2},
+  year={2021},
+  doi={10.1186/s13321-020-00478-9}
+}
+@article{zdrazil2023chembl,
+  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
+  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
+  journal={Nucleic Acids Research},
+  year={2023},
+  volume={gkad1004},
+  doi={10.1093/nar/gkad1004}
+}
+@misc{chembl34,
+  title={ChemBL34},
+  year={2023},
+  doi={10.6019/CHEMBL.database.34}
+}
+@article{Gallo2023,
+  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
+  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
+  journal = {Nucleic Acids Research},
+  year = {2023},
+  month = jan,
+  day = {6},
+  volume = {51},
+  number = {D1},
+  pages = {D654-D659},
+  doi = {10.1093/nar/gkac1008}
+}
+```