Molexar-10M Omni

Molexar-10M Omni is the universal multi-condition model for Molexar, a unified multimodal molecular foundation model for drug design. It starts from fairydance/molexar-10m-base and is supervised fine-tuned to generate Fragment-SELFIES molecules under scalar molecular-property, pharmacophore-fingerprint, protein-sequence, and protein-pocket conditions.

This model corresponds to the Universal Multi-Condition Model described in the Molexar paper.

Project resources:

Model Details

Field Value
Model family Molexar molecular causal language model
Architecture Gemma2-style decoder with value-token embedding replacement for conditions
Base model fairydance/molexar-10m-base
LM component parameters 10,534,912
Total model parameters 14,756,261
Layers 16
Hidden size 256
Intermediate size 640
Attention heads 4 query heads, 1 key-value head
Vocabulary size 127
Context length 256 tokens
Sliding window 128 tokens
Molecular language Fragment-SELFIES
Model files config.json, pytorch_model.bin, tokenizer.json, tokenizer_config.json, training_args.bin

Parameter counts are unique nn.Parameter counts with tied token-embedding/LM-head weights counted once. The LM component includes the token embeddings, Gemma2-style decoder, final normalization, and tied output head; the total additionally includes condition encoders and the pocket GVP encoder.

Molexar uses a shared sequence template for pretraining, SFT, and inference:

<BOS><COND> conditions </COND><SEP><MOL> molecule </MOL><EOS>

The condition block contains ordered key-token/value-token pairs. During conditional generation, selected <VALUE> token embeddings are replaced in place by encoded condition vectors. This keeps all generation modes on the same autoregressive decoding path and remains compatible with key-value-cache generation.

Supported Conditions

Key Meaning Encoding / Range
mol_hac Heavy atom count one-hot, 2 to 50
mol_hbdc Hydrogen-bond donor count one-hot, 0 to 10
mol_hbac Hydrogen-bond acceptor count one-hot, 0 to 22
mol_rotbc Rotatable bond count one-hot, 0 to 20
mol_wt Molecular weight, Da RBF, 30 to 750, 128 steps
mol_logp LogP RBF, -6 to 12, 96 steps
mol_tpsa Topological polar surface area RBF, 0 to 200, 96 steps
mol_qed QED RBF, 0.3 to 1.0, 64 steps
mol_sas Synthetic accessibility score RBF, 1.0 to 5.0, 64 steps
mol_pharma_fp 2D pharmacophore fingerprint direct vector, 1032 dimensions
prot_seq_esm_emb Protein sequence embedding direct vector, 1152 dimensions
prot_poc_gvp_emb Protein pocket geometry embedding GVP/pocket vector, 256 dimensions

Protein sequence conditioning uses mean-pooled ESMC-600M final embeddings in the paper. Pocket conditioning processes no-hydrogen pocket PDB structures with a 25 Angstrom radius, a maximum of 425 atoms, and a directed 8-nearest-neighbor atom graph.

Installation

Install Molexar and Fragment-SELFIES before loading the model:

git clone https://github.com/fairydance/Molexar.git
git clone https://github.com/fairydance/Fragment-SELFIES.git

cd Molexar

conda create -n molexar python=3.13
conda activate molexar

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers accelerate datasets evaluate biopython loguru
conda install -c conda-forge rdkit scipy seaborn

python -m pip install -e ../Fragment-SELFIES
python -m pip install -e . --no-deps

python -c "import fragment_selfies; import molexar; print('Molexar environment ready')"

Install the runtime dependencies listed in the Molexar repository documentation. Fragment-SELFIES is required to convert generated Fragment-SELFIES strings to SMILES. Protein-sequence conditioning also requires the auxiliary ESM embedding environment described by the Molexar repository.

Download

hf download fairydance/molexar-10m-omni --local-dir molexar-10m-omni

Usage

Property-conditioned generation:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --mol_wt 450 \
  --mol_logp 3.5 \
  --mol_hbdc 2 \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical \
  --output_file property_samples.jsonl \
  --output_format jsonl

Pharmacophore-fingerprint conditioning from a reference SMILES:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --condition_key mol_pharma_fp \
  --reference_smiles 'Cc1nnc(N2CCNCC2)s1' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Protein-sequence conditioning:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --protein_sequence 'MKTIIALSYIFCLVFAKDRTEG' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Protein-pocket conditioning:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --pocket_pdb /path/to/pocket.pdb \
  --pocket_radius 25 \
  --max_atoms 425 \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Omni also supports fragment-constrained generation with active conditions by combining condition flags with --generation_task and --start_smiles or --start_string. Supported generation tasks are de_novo, motif_extension, scaffold_decoration, linker_design, scaffold_morphing, and superstructure.

Training

Molexar-10M Omni was initialized from Molexar-10M Base and trained with universal multi-condition SFT. The SFT objective masks the prefix through <MOL> and applies loss to the molecular continuation and closing tokens conditioned on the prompt and injected values.

Training script provenance:

examples/train/bjx_h800_sft_universal_multi_unleaky.sh

The SFT data combines molecule-context and target-context samples. Molecule-context samples use the UniChem-derived Fragment-SELFIES corpus with nine scalar properties and a 2D pharmacophore fingerprint. Target-context samples use protein-ligand pairs from SAIR and the PLINDER training set, with protein-sequence ESM embeddings and processed pocket structures. The Molexar paper reports removing target-context training pairs whose protein sequence had more than 30% identity to any CrossDocked2020 test protein; after filtering, the target-context pool contains 573,463 SAIR pair records and 21,770 PLINDER training-set pair records.

Main training settings from the release script and paper:

Setting Value
Objective Universal multi-condition supervised fine-tuning
Sequence length 256
Epochs 5
Batch size 1000
Learning rate 2e-4
Warmup steps 2000
Molecule:target sample ratio 4:1
Molecule-side active conditions 1, 2, or 3 conditions with probabilities 0.6, 0.3, 0.1
Pharmacophore oversampling probability 0.5
Mixed precision bfloat16
Distributed training Full-shard FSDP on 8 H800 GPUs

Evaluation Highlights

The Molexar paper reports that the SFT model follows single-, dual-, and triple-property instructions and supports pharmacophore, protein-sequence, and pocket-geometry conditioning.

CrossDocked2020 target-conditioned generation highlights:

Conditioning mode Validity Uniqueness Diversity QED SA Lipinski Vina High-affinity ratio
Sequence 1.00 0.98 0.83 0.65 0.82 4.74 -7.25 43.1
Pocket 1.00 0.97 0.84 0.65 0.83 4.82 -7.42 53.0
Pharmacophore 1.00 0.91 0.76 0.59 0.71 4.69 -6.79 38.4

On MolGenBench, the paper reports high chemical-filter pass rates, strong active-molecule and scaffold recovery in de novo generation across protein targets, and favorable hit-to-lead potency when conditioning jointly on pocket and reference-ligand pharmacophore.

Intended Use

This model is intended for research use in molecular generation workflows, including:

  • Property-controlled molecule generation.
  • Pharmacophore-guided molecule generation.
  • Protein-sequence-conditioned target-aware generation.
  • Protein-pocket-conditioned target-aware generation.
  • Multi-condition molecular library ideation.
  • Fragment-constrained generation with optional active conditions.

Generated molecules should be treated as computational hypotheses. They require independent chemical-safety filtering, synthetic feasibility assessment, intellectual-property and dual-use review where relevant, expert medicinal-chemistry assessment, and experimental validation before downstream use.

Limitations

  • The model was trained on filtered drug-like chemistry; rare, contradictory, or out-of-distribution condition combinations may be followed less reliably.
  • Docking, pharmacophore, property, or sequence/pocket scores are not evidence of biological activity, safety, or clinical utility.
  • Protein-sequence and pocket conditioning depend on preprocessing quality, including ESM embeddings and pocket structure preparation.
  • Fragment-SELFIES decoding improves validity but does not guarantee synthetic accessibility, biological activity, safety, or developability.
  • The released tokenizer does not include the iodine token [I]; use bromine substitution in start constraints when necessary, as documented by the Molexar inference script.
  • Stereochemical and explicit 3D output control are outside the scope of this model.

License

This model is released under the MIT License.

Citation

If you use this model, please cite Molexar and Fragment-SELFIES:

@misc{lin2026molexar,
  title = {Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design},
  author = {Lin, Haoyu and Liao, Yiyan and Pan, Jinmei and Ling, Xinliao and Lai, Luhua and Pei, Jianfeng},
  year = {2026},
  url = {https://molexar.com}
}

Code and resources:

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fairydance/molexar-10m-omni

Finetuned
(1)
this model