MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

MolForge is a conditional variational autoencoder over SELFIES representations of molecules, with about 42 million parameters (41,966,682), trained on 9,116,053 molecules curated from five public chemistry databases (Molport, ChEMBL, and ZINC for broad chemical coverage, plus electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you can sample, traverse, and optimize, and because it decodes SELFIES, essentially 100% of generated strings are valid molecules (measured validity 1.000). It is purpose-built for de-novo battery-electrolyte design — generating candidate solvents and additives across chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property model grounded in real electrolyte data.

  • Code / library: https://github.com/NealKapadia/molforge

  • Weights (this repo): checkpoints/best.pt

  • Architecture: embedding 512 → bidirectional GRU encoder (1024 × 2 layers) → latent 256 → GRU decoder (1024 × 2 layers), conditioned on 11 RDKit descriptors, with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.

  • Training data — 7,116,053 molecules from five public databases (filtered to 3–60 heavy atoms and an organic element set, then de-duplicated):

    Database Molecules Role
    Molport "All Stock" 6,088,143 core corpus of purchasable molecules
    ChEMBL-37 (sample) 800,000 bioactive chemical diversity
    ZINC 227,902 additional lead-like diversity
    Total 7,116,053 generative training set

    OEDB and CALiSol-23 additionally provide the electrolyte solvents and 18,918 electrolyte formulations (conductivity, coordination, viscosity) that train the separate property model. Trained with the default SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate electrolyte motifs round-trip.

  • Selected checkpoint: best.pt, selected by val_token_acc + 0.25·valid_rate.

Conditioning properties (fixed order)

MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount

Evaluation (best.pt, 5,000 samples @ temperature 0.9)

Metric Value
Validity 1.000
Uniqueness 0.998
Novelty (vs. training set) 0.995
Internal diversity 0.894
Reconstruction (exact) 0.945
Reconstruction (token acc) 0.998

Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962, NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

On the standard generative benchmark columns (validity / uniqueness / novelty / diversity) this model is competitive with — and on several columns exceeds — the autoregressive ElectrolyteGPT (Kim et al., JACS Au, 2026, 6, 2288–2302). The structural advantage is the latent space: smooth interpolation and gradient-based property optimization, which a left-to-right token model does not offer.

How MolForge differs from existing models

  • A latent space, not left-to-right text generation. Autoregressive models (ElectrolyteGPT, MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can interpolate and optimize with gradients (e.g. "increase molecular weight by 10 while keeping everything else") — a token model cannot.
  • Validity by construction. Decoding SELFIES yields essentially 100% valid molecules (measured 1.000), versus SMILES models that emit invalid strings.
  • A full inverse-design system, not just a generator. The generator is paired with a predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked, scored candidate list.
  • Electrolyte-formulation awareness. Conductivity, coordination, and viscosity are system properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and CALiSol-23 data — most molecule generators ignore this.
  • Multi-database breadth. Trained across five public databases, not a single catalog.

Files

checkpoints/best.pt              # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json             # SELFIES token vocabulary
processed/descriptor_stats.json  # descriptor normalization (mean/std)
processed/meta.json              # vocab size, max length, property order, constraints

This is exactly the layout the molforge library expects under MOLVAE_ART_DIR.

Usage

pip install "git+https://github.com/NealKapadia/molforge.git"
from huggingface_hub import snapshot_download
from molforge import MolForge

art = snapshot_download("NealKapadia/Molforge")   # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art)    # or device="cuda"

mf.generate(10)                                   # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8})   # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z)       # latent round-trip
mf.properties("CCO")                              # RDKit descriptors

Or set the path manually instead of artifacts_dir=: export MOLVAE_ART_DIR=/path/to/download (Windows: $env:MOLVAE_ART_DIR="...").

Limitations & intended use

  • Research / educational use for molecular design and screening — not a substitute for experimental validation, synthesis feasibility, or safety assessment.
  • Soft conditioning: spec targets nudge generation toward a value; they are not exact. For hard constraints, over-generate and filter by RDKit-computed properties.
  • The generator covers a broad space of small-to-medium organic molecules; very small electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
  • The electrolyte property model has labeled data for Li / Na / K only; the generator proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs additional labeled data.

Training data & license

Trained on structures from the Molport "All Stock" catalog, which is licensed CC BY-NC 4.0 (Attribution–NonCommercial). Because these weights are a derivative of that data, they are released under the same CC BY-NC 4.0 license:

  • Attribution: you must credit Molport as the source of the training data.
  • NonCommercial: these weights may not be used for commercial purposes.

The MolForge source code (https://github.com/NealKapadia/molforge) is released under the same CC BY-NC 4.0 license.

Citation

If you use MolForge, please cite this repository and the SELFIES paper (Krenn et al., Mach. Learn.: Sci. Technol. 2020).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support