Molecule Generator — LSTM on SMILES

Repository: huggingface.co/GioFilo93/Molecule-Generation_LSTM-Based
Task: De novo molecule generation (SMILES)
Objective: High validity, uniqueness, novelty with competitive Frechet ChemNet Distance (FCD)

Model Index

Model type: Char-level LSTM language model on SMILES
Library: PyTorch
Languages: SMILES (chemical string notation)
License: (specify, e.g., Apache-2.0)
Tags: chemistry drug-discovery generative-model smiles lstm rdkit

Model description

A lightweight character-level SMILES generator trained with teacher forcing.

Tokenizer: Raw char SMILES with special tokens (PAD, BOS, EOS).
Backbone: 2-layer LSTM, hidden size 512, dropout 0.3.
Head: Linear → Softmax over the SMILES vocabulary.
Decoding: Multinomial sampling with temperature; optional top-k or top-p nucleus sampling.

Why this model:
Fast baseline; easy to extend to conditional or RL fine-tuning; strong for ideation and dataset augmentation.

Intended uses & limitations

Intended uses

Generate unconstrained SMILES for chemical ideation.
Create augmentation corpora for downstream property models.
Provide a baseline for comparing advanced models (transformers, RL, diffusion).

Limitations

No explicit guarantees on synthesizability, safety, or ADMET.
Not conditioned on properties or targets by default.
Valid SMILES ≠ chemically feasible or safe molecules.

⚠️ Do not deploy outputs directly to wet-lab without expert review.

How to use

You can download and load the model weights directly from the Hub:

import torch
from huggingface_hub import hf_hub_download
from model import SmilesLSTM
from tokenizer import CharVocab

# Download model file (.pth) from Hugging Face
model_path = hf_hub_download(
    repo_id="GioFilo93/Molecule-Generation_LSTM-Based",
    filename="lstm_molecule-generation.pth"
)

# Load tokenizer (must be the same vocab used in training)
# Example: vocab.json or hardcoded CharVocab
vocab = CharVocab.load("vocab.json")

# Init model
model = SmilesLSTM(
    vocab_size=len(vocab), emb_dim=128,
    hidden_size=512, num_layers=2, dropout=0.3
)
state = torch.load(model_path, map_location="cpu")
model.load_state_dict(state)
model.eval()

# Sampling
@torch.no_grad()
def sample(n=5, max_len=120, temperature=0.9):
    out = []
    for _ in range(n):
        token = torch.tensor([vocab.bos_id])
        seq, h = [], None
        for _ in range(max_len):
            logits, h = model.step(token, h)
            probs = torch.softmax(logits.squeeze(0) / temperature, dim=-1)
            token = torch.multinomial(probs, 1)
            if token.item() == vocab.eos_id:
                break
            seq.append(token.item())
        out.append(vocab.decode(seq))
    return out

print(sample(n=5, temperature=0.8))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support