gbyuvd commited on
Commit
3d13461
·
verified ·
1 Parent(s): 2fe2af6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -3
README.md CHANGED
@@ -1,3 +1,220 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - sentence-transformers
5
+ - chemistry
6
+ - molecular-similarity
7
+ - cheminformatics
8
+ - unsupervised-learning
9
+ - smiles
10
+ - feature-extraction
11
+ pipeline_tag: sentence-similarity
12
+ library_name: sentence-transformers
13
+ ---
14
+
15
+ # miniChembed-prototype
16
+
17
+ This is a **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including:
18
+
19
+ - **ChEMBL34** (Zdrazil et al., 2023)
20
+ - **COCONUTDB** (Sorokina et al., 2021)
21
+ - **SuperNatural3** (Gallo et al., 2023)
22
+
23
+ The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**.
24
+
25
+ Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules.
26
+ The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations.
27
+
28
+ ---
29
+
30
+ ## Model Details
31
+
32
+ ### Architecture & Training
33
+
34
+ | Attribute | Value |
35
+ |----------|-------|
36
+ | **Base architecture** | Custom RoBERTa-style transformer (4 layers, 320 hidden dim, 4 attention heads, ~4M params) |
37
+ | **Initialization** | Random (not pretrained on text or chemistry) |
38
+ | **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix |
39
+ | **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) |
40
+ | **Training data** | ~24K unique molecules → augmented into positive pairs |
41
+ | **Sequence length** | 512 tokens |
42
+ | **Embedding dimension** | 320 |
43
+ | **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) |
44
+ | **Pooling** | Mean pooling over token embeddings |
45
+ | **Similarity metric** | Cosine similarity |
46
+ | **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) |
47
+ | **Learning rate** | 1e-4 |
48
+ | **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) |
49
+ | **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) |
50
+ | **Barlow λ** | 5.0 (stronger off-diagonal penalty) |
51
+ | **Training duration** | 5 epochs |
52
+ | **Hardware** | Single NVIDIA 930MX GPU |
53
+
54
+ ### Architecture (SentenceTransformer format)
55
+ ```python
56
+ SentenceTransformer(
57
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
58
+ (1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
59
+ )
60
+ ```
61
+
62
+ > 🔍 **Note**: The model was **not initialized from a language model**—it is trained from scratch on SMILES using only the Barlow Twins objective.
63
+
64
+ ---
65
+
66
+ ## Usage
67
+
68
+ ### Installation
69
+ ```bash
70
+ pip install -U sentence-transformers rdkit-pypi
71
+ ```
72
+
73
+ ### Encoding Molecules
74
+ ```python
75
+ from sentence_transformers import SentenceTransformer
76
+
77
+ # Load from Hugging Face Hub
78
+ model = SentenceTransformer("gbyuvd/miniChembed-prototype")
79
+
80
+ # Encode SMILES
81
+ sentences = [
82
+ 'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
83
+ "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline
84
+ "c1ncccc1[C@@H]2CCCN2C", # Nicotine
85
+ 'Nc1nc2cncc-2co1', # CID: 162789184
86
+ ]
87
+
88
+ embeddings = model.encode(sentences)
89
+ print(embeddings.shape) # (4, 320)
90
+
91
+ # Compute pairwise cosine similarities
92
+ similarities = model.similarity(embeddings, embeddings)
93
+ print(similarities)
94
+ # tensor([[1.0000, 0.4342, 0.5141, 0.2582],
95
+ # [0.4342, 1.0000, 0.8779, 0.8886],
96
+ # [0.5141, 0.8779, 1.0000, 0.9551],
97
+ # [0.2582, 0.8886, 0.9551, 1.0000]])
98
+ ```
99
+
100
+ High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
101
+
102
+ > Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
103
+
104
+ ---
105
+
106
+ ## Comparison to Traditional Fingerprints
107
+
108
+ | Feature | ECFP4 / MACCS | miniChembed-prototype |
109
+ |--------|----------------|------------------------|
110
+ | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
111
+ | **Training data** | None (rule-based) | ~24K unlabeled SMILES |
112
+ | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
113
+ | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
114
+
115
+ ---
116
+
117
+ ## Training Summary
118
+
119
+ - **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views.
120
+ - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) ��� mean(cross-molecule cosine)`
121
+ → Higher = better separation between intra- and inter-molecular similarity.
122
+ - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
123
+ - **Final health**: , indicating strong disentanglement.
124
+
125
+ ---
126
+
127
+ ## Limitations
128
+
129
+ - Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown.
130
+ - Input must be **valid SMILES**; invalid strings may produce erratic embeddings.
131
+ - **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function.
132
+ - Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective.
133
+
134
+ ---
135
+
136
+ ## Reproducibility
137
+
138
+ This model was trained using a custom script based on **Sentence Transformers v5.1.0**, with the following environment:
139
+
140
+ - **Python**: 3.13.0
141
+ - **sentence-transformers**: 5.1.0
142
+ - **PyTorch**: 2.6.0
143
+ - **RDKit**: 2023.09.3
144
+ - **Optimizer**: Ranger21 (with epoch-aware warmup/warmdown)
145
+ - **Loss**: Custom `BarlowTwinsLoss` (λ = 5.0)
146
+ - **Augmentation**: RDKit-based stochastic SMILES
147
+
148
+ Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
149
+
150
+ ---
151
+
152
+ ## Reference:
153
+ Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES.
154
+
155
+ ```
156
+ @misc{çağatan2024unseeunsupervisednoncontrastivesentence,
157
+ title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings},
158
+ author={Ömer Veysel Çağatan},
159
+ year={2024},
160
+ eprint={2401.15316},
161
+ archivePrefix={arXiv},
162
+ primaryClass={cs.CL},
163
+ url={https://arxiv.org/abs/2401.15316},
164
+ }
165
+ ```
166
+ ---
167
+
168
+ ## Citation
169
+
170
+ If you use this model, please cite:
171
+
172
+ ```bibtex
173
+ @inproceedings{reimers-2019-sentence-bert,
174
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
175
+ author = "Reimers, Nils and Gurevych, Iryna",
176
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
177
+ year = "2019",
178
+ url = "https://arxiv.org/abs/1908.10084"
179
+ }
180
+
181
+ @article{sorokina2021coconut,
182
+ title={COCONUT online: Collection of Open Natural Products database},
183
+ author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
184
+ journal={Journal of Cheminformatics},
185
+ volume={13},
186
+ number={1},
187
+ pages={2},
188
+ year={2021},
189
+ doi={10.1186/s13321-020-00478-9}
190
+ }
191
+
192
+ @article{zdrazil2023chembl,
193
+ title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
194
+ author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
195
+ journal={Nucleic Acids Research},
196
+ year={2023},
197
+ volume={gkad1004},
198
+ doi={10.1093/nar/gkad1004}
199
+ }
200
+
201
+ @misc{chembl34,
202
+ title={ChemBL34},
203
+ year={2023},
204
+ doi={10.6019/CHEMBL.database.34}
205
+ }
206
+
207
+ @article{Gallo2023,
208
+ author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
209
+ title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
210
+ journal = {Nucleic Acids Research},
211
+ year = {2023},
212
+ month = jan,
213
+ day = {6},
214
+ volume = {51},
215
+ number = {D1},
216
+ pages = {D654-D659},
217
+ doi = {10.1093/nar/gkac1008}
218
+ }
219
+
220
+ ```