ChemMRL / README.md
eacortes's picture
Update README.md
b28aa4d verified
metadata
license: apache-2.0
tags:
  - sentence-transformers
  - modchembert
  - cheminformatics
  - smiles
  - molecular-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:19381001
  - loss:Matryoshka2dLoss
  - loss:MatryoshkaLoss
  - loss:TanimotoSentLoss
base_model: Derify/ModChemBERT-IR-BASE
widget:
  - source_sentence: COC(=O)c1sc(-c2ccc(C)cc2)c2c1NC(=O)C2(c1ccccc1)c1ccccc1
    sentences:
      - COC(=O)c1sc(Nc2ccc(Br)cn2)c2c1NC(=O)C2(c1ccccc1)c1ccccc1
      - CC[NH+]1CCOC(C(NN)c2ccccc2Br)C1
      - CC([NH2+]C(C)c1ccccc1)C(=O)P(C)C(C)(C)C
  - source_sentence: O=C(C=Cc1ccccc1)CC(=O)c1ccccc1O
    sentences:
      - COCCN(NCc1c(C)n(C(C)=O)c2ccc(OC)cc12)c1nccs1
      - CCN(CCC(N)=O)C(=O)c1ccc(=O)[nH]n1
      - N=CCC(=Cc1ccccc1)C(=O)COc1ccccc1O
  - source_sentence: COc1cccc(-c2sc3ccccc3c2C#N)c1
    sentences:
      - COCC(C)(C)c1cnnn1CCCI
      - N#Cc1c(-c2cccc(CN)c2)sc2ccccc12
      - COc1ccccc1NC(=O)c1cc(NCc2ccco2)cc[nH+]1
  - source_sentence: Nc1nc(-c2ccccc2)c2nc(N)c(N)nc2n1
    sentences:
      - >-
        CC(C)CC1NC(=O)C(Cc2ccccc2)NC(=O)c2ccc(cc2)CN(C(=O)CC2CCOCC2)CCCCNC(=O)C(C)NC1=O
      - O=Nc1cccc(OCCC(F)F)c1
      - CCCCNCc1nc(N)nc2nc(N)c(N)nc12
  - source_sentence: OCCCc1cc(F)cc(F)c1
    sentences:
      - CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1
      - Cc1[nH]c2c(C(N)=O)ccc(C(=O)N3CCCCC3)c2c1C
      - Fc1cc(F)cc(-n2cc[o+]n2)c1
datasets:
  - Derify/pubchem_10m_genmol_similarity
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - spearman
co2_eq_emissions:
  emissions: 6350.153020081601
  energy_consumed: 30.935740629629628
  source: codecarbon
  training_type: fine-tuning
  on_cloud: false
  cpu_model: AMD Ryzen 7 3700X 8-Core Processor
  ram_total_size: 62.69887161254883
  hours_used: 116.388
  hardware_used: 2 x NVIDIA GeForce RTX 3090
model-index:
  - name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: pubchem 10m genmol similarity (validation)
          type: pubchem_10m_genmol_similarity_validation
        metrics:
          - type: spearman
            value: 0.989142152637452
            name: Spearman
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: pubchem 10m genmol similarity (test)
          type: pubchem_10m_genmol_similarity_test
        metrics:
          - type: spearman
            value: 0.9891625268496924
            name: Spearman

ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

This is a Chem-MRL (sentence-transformers) model finetuned from Derify/ModChemBERT-IR-BASE on the pubchem_10m_genmol_similarity dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModChemBertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Chem-MRL)

First install the Chem-MRL library:

pip install -U chem-mrl>=0.7.3
pip install -U "transformers>=4.56.1,<5.0.0"

Then you can load this model and run inference.

from chem_mrl import ChemMRL

# Download from the 🤗 Hub
model = ChemMRL(
    "Derify/ChemMRL",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.backbone.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.backbone.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3876, 0.0078],
#         [0.3876, 1.0000, 0.0028],
#         [0.0078, 0.0028, 1.0000]])

Direct Usage (Sentence Transformers)

Click to see the direct usage in Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer(
    "Derify/ChemMRL",
    # SentenceTransformer doesn't support tanimoto similarity natively so we set a different similarity function here
    similarity_fn_name="cosine",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5587, 0.0155],
#         [0.5587, 1.0000, 0.0055],
#         [0.0155, 0.0055, 1.0000]])

Evaluation

Metrics

Semantic Similarity

  • Dataset: pubchem_10m_genmol_similarity
  • Evaluated with chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator with these parameters:
    {
        "precision": "float32"
    }
    
Split Metric Value
validation spearman 0.98914
test spearman 0.98916

Training Details

Training Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 19,381,001 training samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 17 tokens
    • mean: 42.36 tokens
    • max: 122 tokens
    • min: 11 tokens
    • mean: 40.93 tokens
    • max: 122 tokens
    • min: 0.02
    • mean: 0.56
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    COc1ccc(NC(=O)C2CC[NH+](C(C)C(=O)Nc3ccc(C(=O)Nc4ccc(F)c(F)c4)cc3C)CC2)cc1NC(=O)C1CCCCC1 Cc1cc(C(=O)Nc2ccc(F)c(F)c2)ccc1NC(=O)C(C)[NH+]1CCC(C(=O)Nc2cccc(NC(=O)C3CCCCC3)c2)CC1 0.8495575189590454
    OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1 OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1 0.6615384817123413
    CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1 CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1 0.7123287916183472
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.0,
        "kl_temperature": 0.0,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 1,080,394 evaluation samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 16 tokens
    • mean: 42.05 tokens
    • max: 101 tokens
    • min: 11 tokens
    • mean: 40.23 tokens
    • max: 104 tokens
    • min: 0.0
    • mean: 0.57
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCCC1 0.8600000143051147
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCOCC1 0.7962962985038757
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)CC(F)F 0.5517241358757019
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.0,
        "kl_temperature": 0.0,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • max_grad_norm: None
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_steps: 100943
  • data_seed: 42
  • bf16: True
  • bf16_full_eval: True
  • tf32: True
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • gradient_checkpointing: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: None
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_ratio: 0.0
  • warmup_steps: 100943
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: 42
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: True
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: True
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss pubchem 10m genmol similarity loss pubchem_10m_genmol_similarity_spearman
0 0 - 297.6136 0.7261
0.0000 1 244.6862 - -
0.2477 25000 161.5037 - -
0.2500 25235 - 195.4624 0.9067
0.4978 50250 155.7822 - -
0.5000 50470 - 189.4068 0.9655
0.7479 75500 152.7915 - -
0.7500 75705 - 186.3661 0.9780
0.9981 100750 151.0411 - -
1.0000 100940 - 184.6362 0.9829
1.2482 126000 149.8544 - -
1.2500 126175 - 183.5648 0.9855
1.4984 151250 149.2916 - -
1.5000 151410 - 182.8947 0.9868
1.7485 176500 148.7942 - -
1.7499 176645 - 182.3662 0.9879
1.9987 201750 148.3459 - -
1.9999 201880 - 181.9855 0.9885
2.2488 227000 148.0316 - -
2.2499 227115 - 181.7683 0.9889
2.4989 252250 147.8658 - -
2.4999 252350 - 181.6711 0.9890
2.7491 277500 147.9642 - -
2.7499 277585 - 181.6077 0.9891
2.9992 302750 147.8874 - -
2.9999 302820 - 181.6066 0.9891
3.0000 302829 - - 0.98914

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 30.936 kWh
  • Carbon Emitted: 6.350 kg of CO2
  • Hours Used: 116.388 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: AMD Ryzen 7 3700X 8-Core Processor
  • RAM Size: 62.70 GB

Framework Versions

  • Python: 3.13.7
  • Sentence Transformers: 5.1.2
  • Transformers: 4.57.1
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 4.3.0
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Matryoshka2dLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings},
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}

TanimotoSentLoss

@online{cortes-2025-tanimotosentloss,
    title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
    author={Emmanuel Cortes},
    year={2025},
    month={Jan},
    url={https://github.com/emapco/chem-mrl},
}

Model Card Authors

@eacortes

Model Card Contact

Manny Cortes (manny@derifyai.com)