YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Molecule Generator โ€” LSTM on SMILES

Repository: huggingface.co/GioFilo93/Molecule-Generation_LSTM-Based
Task: De novo molecule generation (SMILES)
Objective: High validity, uniqueness, novelty with competitive Frechet ChemNet Distance (FCD)


Model Index

  • Model type: Char-level LSTM language model on SMILES
  • Library: PyTorch
  • Languages: SMILES (chemical string notation)
  • License: (specify, e.g., Apache-2.0)
  • Tags: chemistry drug-discovery generative-model smiles lstm rdkit

Model description

A lightweight character-level SMILES generator trained with teacher forcing.

  • Tokenizer: Raw char SMILES with special tokens (PAD, BOS, EOS).
  • Backbone: 2-layer LSTM, hidden size 512, dropout 0.3.
  • Head: Linear โ†’ Softmax over the SMILES vocabulary.
  • Decoding: Multinomial sampling with temperature; optional top-k or top-p nucleus sampling.

Why this model:
Fast baseline; easy to extend to conditional or RL fine-tuning; strong for ideation and dataset augmentation.


Intended uses & limitations

Intended uses

  • Generate unconstrained SMILES for chemical ideation.
  • Create augmentation corpora for downstream property models.
  • Provide a baseline for comparing advanced models (transformers, RL, diffusion).

Limitations

  • No explicit guarantees on synthesizability, safety, or ADMET.
  • Not conditioned on properties or targets by default.
  • Valid SMILES โ‰  chemically feasible or safe molecules.

โš ๏ธ Do not deploy outputs directly to wet-lab without expert review.


How to use

You can download and load the model weights directly from the Hub:

import torch
from huggingface_hub import hf_hub_download
from model import SmilesLSTM
from tokenizer import CharVocab

# Download model file (.pth) from Hugging Face
model_path = hf_hub_download(
    repo_id="GioFilo93/Molecule-Generation_LSTM-Based",
    filename="lstm_molecule-generation.pth"
)

# Load tokenizer (must be the same vocab used in training)
# Example: vocab.json or hardcoded CharVocab
vocab = CharVocab.load("vocab.json")

# Init model
model = SmilesLSTM(
    vocab_size=len(vocab), emb_dim=128,
    hidden_size=512, num_layers=2, dropout=0.3
)
state = torch.load(model_path, map_location="cpu")
model.load_state_dict(state)
model.eval()

# Sampling
@torch.no_grad()
def sample(n=5, max_len=120, temperature=0.9):
    out = []
    for _ in range(n):
        token = torch.tensor([vocab.bos_id])
        seq, h = [], None
        for _ in range(max_len):
            logits, h = model.step(token, h)
            probs = torch.softmax(logits.squeeze(0) / temperature, dim=-1)
            token = torch.multinomial(probs, 1)
            if token.item() == vocab.eos_id:
                break
            seq.append(token.item())
        out.append(vocab.decode(seq))
    return out

print(sample(n=5, temperature=0.8))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support