ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES

This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:

On varied masking:
- Perplexity of 1.4029, MLM Accuracy of 88.83%
On uniform 15% masking:
- Perplexity of 1.3276, MLM Accuracy of 90.55%

The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on a simple scoring to gauge molecular complexity.

The model demonstrated strong performance on both regression and classification tasks when fine-tuned with appropriate hyperparameters and optimizer, showing competitive results when compared to state-of-the-art models. Detailed evaluations can be found in the Regression Tasks and Classification Tasks sections below.

Fine-Tuned Variants

ChemFIE-BED: Sentence transformer for molecular similarity using SELFIES. Maps compounds to 320-dimensional vectors with Matryoshka embedding (320, 160, 80 dimensions). Fine-tuned on ~2 million SELFIES pairs. High performance: Pearson correlation 0.9605, Spearman 0.9520 for cosine similarity.
ChemFIE-SA: BERT-like sequence classifier for predicting synthesis accessibility using SELFIES. Fine-tuned on DeepSA expanded dataset from Wang et al. 2023. High performance across multiple metrics: F1-score of 0.947 and AUROC of 0.990 on expanded test set, outperforming most SMILES-based models. Excels on TS1 (F1-score: 0.996, AUROC: 1.000), competitive on TS2 and TS3.
ChemFIE-DTP: BERT-like sequence classifier for drug target prediction across 221 human protein classes. Processes SELFIES to predict targets. Trained on ~154,000 balanced examples from ChemBL34. Test metrics: accuracy 0.620, weighted precision 0.614, recall 0.620, F1 0.613. Potential for early-stage drug discovery, but need more careful evaluation. Based on the model's training and evaluation losses, there is potential for improvement with further training; however, I cannot afford it at the moment.

Disclaimer: For Academic Purposes Only

The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.

Model Details
Usage
Bias and Limitations
Background
Training Details
Evaluation
Interpretability
Technical Specifications
Citation
Contact & Support My Work

Model Details

Model Description

Model type: Transformer (BertForMaskedLM)
Language: SELFIES
Maximum Sequence Length: 512 tokens
License: CC-BY-NC-SA 4.0
Training Dataset: COCONUTDB and ChemBL34
Resources for more information:
- Github Respository (coming soon)
- Detailed article (coming soon)

Usage

Intended Use

You can use this model for masked language modeling but it's mostly intended to be fine-tuned on a downstream task.

Direct Use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

# text = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"

text = "[C] [C] [=Branch1] [C] [MASK] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
  
mask_filler = pipeline("fill-mask", "gbyuvd/chemselfies-base-bertmlm")

mask_filler(text, top_k=5)

"""
[{'score': 0.9974672794342041,
  'token': 8,
  'token_str': '[=O]',
  'sequence': '[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
 {'score': 0.002122726757079363,
  'token': 34,
  'token_str': '[=S]',
  'sequence': '[C] [C] [=Branch1] [C] [=S] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
 {'score': 0.0002627855574246496,
  'token': 11,
  'token_str': '[=N]',
  'sequence': '[C] [C] [=Branch1] [C] [=N] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
 {'score': 8.700760372448713e-05,
  'token': 1,
  'token_str': '[C]',
  'sequence': '[C] [C] [=Branch1] [C] [C] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
 {'score': 2.8387064958224073e-05,
  'token': 2,
  'token_str': '[=C]',
  'sequence': '[C] [C] [=Branch1] [C] [=C] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'}]
"""

In case you have SMILES instead, you can convert it first to SELFIES. First install the selfies library:

pip install selfies

then you can convert using:

import selfies as sf

def smiles_to_selfies_sentence(smiles):
    try:
        selfies = sf.encoder(smiles)  # Encode SMILES into SELFIES
        selfies_tokens = list(sf.split_selfies(selfies))
        
        # Join dots with the nearest next tokens
        joined_tokens = []
        i = 0
        while i < len(selfies_tokens):
            if selfies_tokens[i] == '.' and i + 1 < len(selfies_tokens):
                joined_tokens.append(f".{selfies_tokens[i+1]}")
                i += 2
            else:
                joined_tokens.append(selfies_tokens[i])
                i += 1
        
        selfies_sentence = ' '.join(joined_tokens)
        return selfies_sentence
    except sf.EncoderError as e:
        print(f"Encoder Error: {e}")
        return None

# Example usage:
in_smi = "CN(C)CCC(C1=CC=C(C=C1)Cl)C2=CC=CC=N2.C(=CC(=O)O)C(=O)O" # Chlorphenamine maleate
selfies_sentence = smiles_to_selfies_sentence(in_smi)
print(selfies_sentence)

"""
[C] [N] [Branch1] [C] [C] [C] [C] [C] [Branch1] [N] [C] [=C] [C] [=C] [Branch1] [Branch1] [C] [=C] [Ring1] [=Branch1] [Cl] [C] [=C] [C] [=C] [C] [=N] [Ring1] [=Branch1] .[C] [=Branch1] [#Branch1] [=C] [C] [=Branch1] [C] [=O] [O] [C] [=Branch1] [C] [=O] [O]
"""

Background

Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by Aspuru-Guzik group. I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this blogpost or check out their github.

My initial attempt focused on training a sentence transformer based on SELFIES, with the goal of enabling rapid molecule similarity search and clustering. This approach potentially offers advantages over traditional fingerprinting algorithms like MACCS, as the embeddings are context-aware. I decided to fine-tune a relatively lightweight NLP-trained miniLM model by Nils Reimers, as I was unsure about training from scratch and didn't even know about pre-training at that time.

The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from COCONUTDB. After some initial training, I pushed the model to Hugging Face to get some feedback. Gladly, Tom Aarsen provided valuable suggestions, including training a custom tokenizer, exploring Matryoshka embeddings, and considering training from scratch. The attempt to implement Aarsen's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.

Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.

Training Details

Training Data

Data Sources

The dataset combines two sources of molecular data:

Natural compounds from COCONUTDB (Sorokina et al., 2021)
Bioactive compounds from ChemBL34 (Zdrazil et al., 2023)

Data Preparation

Fetching: Canonical SMILES were extracted from both databases.
De-duplication:
- Each dataset was de-duplicated internally.
- The combined dataset ("All") was further de-duplicated to ensure unique entries.
Validity Check and Conversion: A dual validity check was performed using RDKit and by converting them into SELFIES

Filtering and Chunking

Filtering by Lipinski's Rule of Five or its subsets (e.g., Mw < 500 and LogP < 5) was omitted to maintain broader coverage for potential future expansion to organic and inorganic molecules such in PubCHEM and ZINC20.
The dataset was chunked into 13 parts, each containing 203,458 molecules, to accommodate the 6-hour time limit on Paperspace's Gradient.
Any leftover data was randomly distributed across the 13 chunks to ensure even distribution.

Validation Set

10% of each chunk was set aside for validation.
These validation sets were combined into a main test set, totaling 810,108 examples.

Dataset	Number of Valid Unique Molecules	Generated Training Examples
Chunk I	207,727	560,859
Chunk II	207,727	560,859
Chunk III	207,727	560,859
Chunk IV	207,727	560,859
Chunk V	207,727	560,859
Chunk VI	207,727	560,859
Chunk VII	207,727	560,859
Chunk VIII	207,727	560,859
Chunk IX	207,727	560,859
Chunk X	207,727	560,859
Chunk XII	207,727	560,859
Chunk XI	207,727	560,859
Chunk XIII	207,738	560,889
Total	2,700,462	7,291,197

Training Procedure

Tokenizer Setup

The tokenizer is a combination of my own pretrained tokenizer on the merged COCONUTDB+ChemBL34 SELFIES dataset with vocabularies from zpn's word-level tokenizer trained on PubChem. This approach was chosen to ensure comprehensive coverage while maintaining relevance to biological compounds. The tokenizer was modified to suit the BertTokenizer format, using whitespace to split input tokens.

When using or fine-tuning this model, it's crucial to separate each SELFIES token with a whitespace. For example:

[C] [N] [C] [C] [C] [C@H1] [Ring1] [Branch1] [C] [=C] [N] [=C] [C] [=C] [Ring1] [=Branch1]

To ensure coverage, the tokenizer underwent evaluation to cover all tokens in the training data. Unrecognized tokens were identified and incorporated into the tokenizer. Additionally, my previous pre-training issues, such as improper tokenization of dot symbol prefixes in complex molecules (e.g., ".[Cl]"), were addressed and resolved.

Generating Dynamic Masked Sequence

The key method in this project is the implementation of a dynamic masking rate based on molecular complexity. I think we can heuristically infer a molecule's complexity based on the syntactic characteristics of SELFIES. Simpler tokens will have only one character, such as [N] (l = 1; ignoring the brackets), while a more complex one such as .[N+1] (l = 4). Relatively rare atoms compared to the SPONCH, like [Na] (l = 2), and ionized metals like [Fe+3] (l = 4), also vary in complexity. To normalize and infer the density of many character tokens, we can sum of all tokens length ratio with the molecule's length. I will refer to this simple score as the "complexity score" hereafter. We can then normalize it and use it to determine a variable masking probability ranging from 15% to 45%. Additionally, we can employ three different masking strategies to introduce further variability. This approach aims to create a more challenging and diverse training dataset while getting the most out of it, potentially leading to a more robust and generalizable model for molecular representation learning. Each SELFIES string's complexity is calculated based on the logarithm of the sum of token ratios with the sequence length.

1. Complexity Score Calculation

The raw complexity score is calculated using the formula:

$Sc = \log\left[\sum\left(\frac{l_{\text{token}}}{n_{\text{tokens}}}\right)\right]$

Example outputs:

Sentence A:
Tokens: ['[C]', '[C]', '[=Branch1]', '[C]', '[=O]', '[O]', '[C]']
Sum of token lengths: 29
Number of tokens: 7
Raw complexity score: 1.4214

==================================================

Sentence B:
Tokens: ['[C]', '[N+1]', '[Branch1]', '[C]', '[C]', '[Branch1]', '[C]', '[C]', '[C]']
Sum of token lengths: 41
Number of tokens: 9
Raw complexity score: 1.5163

2. Normalization

The raw score is then normalized to a range of 0-1 using predefined minimum (1.39) and maximum (1.69) normalization values which determined from dataset's score distributions:

$Sc_{norm} = max(0, min(1, (Sc - min_{norm}) / (max_{norm} - min_{norm})))$

3. Mapping to Masking Probability

I decided to use quadratic mapping with 0.3 steps, ensuring smooth masking probability adjustment in range between 15% to 45% with more complex molecules having a higher masking probability:

$P_{\text{mask}} = 0.15 + 0.3 * (Sc_{norm})^2$

4. Multi-Strategy Masking

Three different masking strategies are employed for each SELFIES string:

Main Strategy:
- 80% chance to mask the token
- 10% chance to keep the original token
- 10% chance to replace with a random token
Alternative Strategy 1:
- 70% chance to mask the token
- 15% chance to keep the original token
- 15% chance to replace with a random token
Alternative Strategy 2:
- 60% chance to mask the token
- 20% chance to keep the original token
- 20% chance to replace with a random token

5. Data Augmentation

Each SELFIES string is processed three times, once with each masking strategy.
This hopefully triples the effective dataset size and introduces variability in the masking patterns.

6. Masking Process

Tokens are randomly selected for masking based on the calculated masking probability.
Special tokens ([CLS] and [SEP]) are never masked.
The number of tokens to be masked is determined by the masking probability and the length of the SELFIES string.

This methodology aims to create a diverse and challenging dataset for masked language modeling of molecular structures, adapting the masking intensity to the complexity of each molecule and employing multiple masking strategies to improve model robustness and generalization. Also, beside masking differently based on complexity scores, the on-the-fly data generation might ensure that each run and batches - the data are masked differently. But additional and further confirmation of this is needed.

Training Hyperparameters

Batch size = 128
Num of Epoch:
- 1 epoch for all chunks
- another 1 epoch on selected chunks (but contains some samples from those excluded due to overfitting tendencies)
Total steps on all chunks = 70,619
Training time on each chunk = 03h:24m / ~205 mins

I am using Ranger21 optimizer with these settings:

Core optimizer = madgrad
Learning rate of 2e-05

num_epochs of training = ** 1 epochs **

using AdaBelief for variance computation
Warm-up: linear warmup, over 964 iterations (0.22)

Lookahead active, merging every 5 steps, with blend factor of 0.5
Norm Loss active, factor = 0.0001
Stable weight decay of 0.01
Gradient Centralization = On

Adaptive Gradient Clipping = True
    clipping value of 0.01
    steps for clipping = 0.001

I turned off the warm down, since in prior experiments it led to instability of losses in my case. For more information about Ranger21, you could check out this repository.

Evaluation

Dataset: main-eval
Number of test examples: 810,108

Varied Masking Test

1st Epoch

Chunk	Avg Loss	Perplexity	MLM Accuracy
I-IV	0.4547	1.5758	0.851
V-VIII	0.4224	1.5257	0.864
IX-XIII	0.3893	1.4759	0.876

2nd Epoch

Chunk	Avg Loss	Perplexity	MLM Accuracy
I-II	0.3659	1.4418	0.8793
VII	0.3386	1.4029	0.8883

Uniform 15% Masking Test (80%:10%:10%)

1st Epoch

Chunk	Avg Loss	Perplexity	MLM Accuracy
XII	0.3349	1.3978	0.8929

2nd Epoch

Chunk	Avg Loss	Perplexity	MLM Accuracy
	0.2834	1.3276	0.9055

Downstream Task(s)

Sentence Transformer

Semantic Similarity

for more information, check out the fine-tuned model here

Dataset: combined-test
Number of test pairs: 898,980
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.9605
spearman_cosine	0.9520
pearson_manhattan	0.8788
spearman_manhattan	0.8587
pearson_euclidean	0.8802
spearman_euclidean	0.8612
pearson_dot	0.8414
spearman_dot	0.8421
pearson_max	0.9605
spearman_max	0.9520

Molecular Property Prediction Tasks

ChemFIE was evaluated on both regression and classification tasks from the MoleculeNet benchmarks (Wu et al., 2018). For all tasks, data was split using an 80:10:10 ratio for train:validation:test sets. To ensure robust results, each task was trained 5 times with different random seeds. The reported results are the average performance across these 5 runs, with standard deviation provided.

Comparison data is sourced from Soares et al. (2024), the authors of the recently released SMI-TED.

Regression Tasks

FreeSolv: Hydration free energy of small molecules in water
ESOL: Water solubility (log solubility in mols per litre) for common organic small molecules
Lipophilicity: Octanol/water distribution coefficient (logD at pH 7.4)

Fine-tuning parameters:

FreeSolv: Batch size = 64, Epochs = 40
ESOL: Batch size = 64, Epochs = 32
Lipophilicity: Batch size = 128, Epochs = 37

All regression tasks used a learning rate of 1.5e-5 with the Ranger21 (core optimizer using Madgrad) optimizer. Root Mean Square Error (RMSE) was used as the primary evaluation metric.

Method	FreeSolv	ESOL	Lipophilicity
D-MPNN	2.18 ± 0.91	0.98 ± 0.26	0.65 ± 0.05
N-Gram	2.688 ± 0.085	1.074 ± 0.107	0.812 ± 0.028
PretrainGNN	2.764 ± 0.002	1.100 ± 0.006	0.739 ± 0.003
GROVER_{Large}	2.272 ± 0.051	0.895 ± 0.017	0.823 ± 0.010
ChemBERTa-2	-	0.89	0.8
SPMM	1.907 ± 0.058	0.818 ± 0.008	0.692 ± 0.008
MolCLR_{GIN}	2.20 ± 0.20	1.11 ± 0.01	0.65 ± 0.08
GPT-GNN	2.83 ± 0.12	1.22 ± 0.02	0.74 ± 0.00
MoLFormer	2.342 ± 0.052	0.880 ± 0.028	0.700 ± 0.012
SMI-TED289M (Fine-tuned)	1.2233 ± 0.0029	0.6112 ± 0.0096	0.5522 ± 0.0194
ChemFIE	1.0832 ± 0.2292	0.5779 ± 0.0522	0.7104 ± 0.0376

The results show that ChemFIE achieved competitive performance across all three tasks.

Classification Tasks

BBBP: Blood-Brain Barrier Penetration
HIV: Inhibition of HIV replication
BACE: β-secretase 1 (BACE-1) inhibition
ClinTox: Clinical trial toxicity (Work in Progress)

Fine-tuning parameters:

BBBP: Batch size = 64, Epochs = 3, Learning rate = 1.5e-5, random split
HIV: Batch size = 128, Epochs = 7, Learning rate = 5e-6, Warm-up = 50%, stratified split
BACE: Batch size = 32, Epochs = 9, Learning rate = 5e-6, Warm-up = 50%, stratified split

All classification tasks used the Ranger21 (Madgrad) optimizer. The primary evaluation metric is ROC-AUC (Area Under the Receiver Operating Characteristic Curve).

Method	BBBP	HIV	BACE	ClinTox
GraphMVP	72.4 ± 1.6	77.0 ± 1.2	81.2 ± 0.9	79.1 ± 2.8
GEM	72.4 ± 0.4	80.6 ± 0.9	85.6 ± 1.1	90.1 ± 1.3
GROVER_{Large}	69.5 ± 0.1	68.2 ± 1.1	81.0 ± 1.4	76.2 ± 3.7
ChemBERTa	64.3	62.2	-	90.6
ChemBERTa-2	71.94	-	85.1	90.7
Galatica 30B	59.6	75.9	72.7	82.2
Galatica 120B	66.1	74.5	61.7	82.6
Uni-Mol	72.9 ± 0.6	80.8 ± 0.3	85.7 ± 0.2	91.9 ± 1.8
MolFM	72.9 ± 0.1	78.8 ± 1.1	83.9 ± 1.1	79.7 ± 1.6
MoLFormer	73.6 ± 0.8	80.5 ± 1.65	86.3 ± 0.6	91.2 ± 1.4
SMI-TED289M (Fine-tuned)	92.26 ± 0.57	76.85 ± 0.89	88.24 ± 0.50	94.27 ± 1.83
ChemFIE	94.98 ± 1.29	82.00 ± 0.88	86.05 ± 1.36	WIP

Work is currently in progress (WIP) for the remaining classification task (ClinTox). Results for this task will be updated as they become available.

Additional Validations

Multi-class single label classification: Results can be found in ChemFIE-DTP.
Multi-label classification: To be implemented in future work

Interpretability

Using Acetylcholine as an example, and its quaternary amine's nitrogen ([N+1]) masked for Captum's visualization.

Attention Head Visualization

Using BertViz:

from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = "gbyuvd/chemselfies-base-bertmlm"  
input_text = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
model = AutoModel.from_pretrained(model_name, output_attentions=True)  # Configure model to return attention values
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens, display_mode="light")  # Display model view

these are the first four layers of self-attention:

For more detailed visualization we can try split the acetylcholine into two sentences (like in NSP)

from bertviz import head_view

sentence_a = "[C] [C] [=Branch1] [C] [=O] [O] [C]"
sentence_b = "[C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 
head_view(attention, tokens, sentence_b_start)

Neuron Views

Using the pytorch version of this model, but you need to download it and put it in a directory along with the tokenizer:

from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model_type = 'bert'
model_version = './dir'
do_lower_case = False

sentence_a = "[C] [C] [=Branch1] [C] [=O] [O] [C]"
sentence_b = "[C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_version)
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=2, head=0, html_action='return')

Attributions in Determining Masked Tokens

Using Captum

Bias and Limitations

The model is trained on a relatively small subset of the total available chemical space (compared to ZINC20 and PubChem), which may introduce biases based on the training data composition.
Only 13.2% of the model's total vocabulary is covered by tokens in its training data. This may result in suboptimal performance for tokens not seen in CoconutDB and/or ChemBL34, particularly:
- Heavy metal-containing compounds
- Isotope-containing compounds
- Other rare or specialized chemical entities
Although the model hasn't been trained on many rarer tokens due to the limited dataset, its vocabulary is comprehensive enough to include these tokens. This means you can potentially fine-tune the model on datasets containing these rarer molecules, using the existing vocabulary setup.

Technical Specifications

Model Architecture and Objective

Layers: 8
Attention Heads: 4
Hidden Size: 320
Intermediate Size: 1280 (4H)
Attention Type: SDPA

Compute Infrastructure

Hardware

Platform: Paperspace's Gradients
Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)

Software

Python: 3.9.13
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Accelerate: 0.32.0
Datasets: 2.20.0
Tokenizers: 0.19.1
Ranger21: 0.0.1
Selfies: 2.1.2
RDKit: 2024.3.3

Citation

If you find this project useful in your research and wish to cite it, please use the following BibTex entry:

@software{chemfie_basebertmlm,
  author = {GP Bayu},
  title = {{ChemFIE Base}: Pretraining A Lightweight BERT-like model on Molecular SELFIES},
  url = {https://huggingface.co/gbyuvd/chemselfies-base-bertmlm},
  version = {1.0},
  year = {2024},
}

References

ChemBL34

`@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

SELFIES

@article{krenn2020selfies,
  title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
  author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
  journal={Machine Learning: Science and Technology},
  volume={1},
  number={4},
  pages={045024},
  year={2020},
  doi={10.1088/2632-2153/aba947}
}

Ranger21

@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

MoleculeNet

@article{C7SC02664A,
      title  ={MoleculeNet: a benchmark for molecular machine learning},
      author ={Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay},
      journal  ={Chem. Sci.},
      year  ={2018},
      volume  ={9},
      issue  ={2},
      pages  ={513-530},
      publisher  ={The Royal Society of Chemistry},
      doi  ={10.1039/C7SC02664A},
      url  ={http://dx.doi.org/10.1039/C7SC02664A}

SMI-TED

@misc{soares2024largeencoderdecoderfamilyfoundation,
      title={A Large Encoder-Decoder Family of Foundation Models For Chemical Language}, 
      author={Eduardo Soares and Victor Shirasuna and Emilio Vital Brazil and Renato Cerqueira and Dmitry Zubarev and Kristin Schmidt},
      year={2024},
      eprint={2407.20267},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.20267}, 
}

Contact & Support My Work

G Bayu (gbyuvd@proton.me)

This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.

If you find my work valuable and would like to support my journey, please consider supporting me here. Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.

Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.

ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES

Fine-Tuned Variants

Disclaimer: For Academic Purposes Only

Table of Contents

Model Details

Model Description

Usage

Intended Use

Direct Use

Background

Training Details

Training Data

Data Sources

Data Preparation

Filtering and Chunking

Validation Set

Training Procedure

Tokenizer Setup

Generating Dynamic Masked Sequence

Training Hyperparameters

Evaluation

Varied Masking Test

1st Epoch

2nd Epoch

Uniform 15% Masking Test (80%:10%:10%)

1st Epoch

2nd Epoch

Downstream Task(s)

Sentence Transformer

Semantic Similarity

Molecular Property Prediction Tasks

Regression Tasks

Classification Tasks

Additional Validations

Interpretability

Attention Head Visualization

Neuron Views

Attributions in Determining Masked Tokens

Bias and Limitations

Technical Specifications

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation

References

Contact & Support My Work

Model tree for gbyuvd/chemselfies-base-bertmlm

Collection including gbyuvd/chemselfies-base-bertmlm

Evaluation results