Molexar-10M Omni

Molexar-10M Omni is the universal multi-condition model for Molexar, a unified multimodal molecular foundation model for drug design. It starts from fairydance/molexar-10m-base and is supervised fine-tuned to generate Fragment-SELFIES molecules under scalar molecular-property, pharmacophore-fingerprint, protein-sequence, and protein-pocket conditions.

This model corresponds to the Universal Multi-Condition Model described in the Molexar paper.

Project resources:

Molexar code: https://github.com/fairydance/Molexar
Fragment-SELFIES code: https://github.com/fairydance/Fragment-SELFIES
Official website: https://molexar.com

Model Details

Field	Value
Model family	Molexar molecular causal language model
Architecture	Gemma2-style decoder with value-token embedding replacement for conditions
Base model	`fairydance/molexar-10m-base`
LM component parameters	10,534,912
Total model parameters	14,756,261
Layers	16
Hidden size	256
Intermediate size	640
Attention heads	4 query heads, 1 key-value head
Vocabulary size	127
Context length	256 tokens
Sliding window	128 tokens
Molecular language	Fragment-SELFIES
Model files	`config.json`, `pytorch_model.bin`, `tokenizer.json`, `tokenizer_config.json`, `training_args.bin`

Parameter counts are unique nn.Parameter counts with tied token-embedding/LM-head weights counted once. The LM component includes the token embeddings, Gemma2-style decoder, final normalization, and tied output head; the total additionally includes condition encoders and the pocket GVP encoder.

Molexar uses a shared sequence template for pretraining, SFT, and inference:

<BOS><COND> conditions </COND><SEP><MOL> molecule </MOL><EOS>

The condition block contains ordered key-token/value-token pairs. During conditional generation, selected <VALUE> token embeddings are replaced in place by encoded condition vectors. This keeps all generation modes on the same autoregressive decoding path and remains compatible with key-value-cache generation.

Supported Conditions

Key	Meaning	Encoding / Range
`mol_hac`	Heavy atom count	one-hot, 2 to 50
`mol_hbdc`	Hydrogen-bond donor count	one-hot, 0 to 10
`mol_hbac`	Hydrogen-bond acceptor count	one-hot, 0 to 22
`mol_rotbc`	Rotatable bond count	one-hot, 0 to 20
`mol_wt`	Molecular weight, Da	RBF, 30 to 750, 128 steps
`mol_logp`	LogP	RBF, -6 to 12, 96 steps
`mol_tpsa`	Topological polar surface area	RBF, 0 to 200, 96 steps
`mol_qed`	QED	RBF, 0.3 to 1.0, 64 steps
`mol_sas`	Synthetic accessibility score	RBF, 1.0 to 5.0, 64 steps
`mol_pharma_fp`	2D pharmacophore fingerprint	direct vector, 1032 dimensions
`prot_seq_esm_emb`	Protein sequence embedding	direct vector, 1152 dimensions
`prot_poc_gvp_emb`	Protein pocket geometry embedding	GVP/pocket vector, 256 dimensions

Protein sequence conditioning uses mean-pooled ESMC-600M final embeddings in the paper. Pocket conditioning processes no-hydrogen pocket PDB structures with a 25 Angstrom radius, a maximum of 425 atoms, and a directed 8-nearest-neighbor atom graph.

Installation

Install Molexar and Fragment-SELFIES before loading the model:

git clone https://github.com/fairydance/Molexar.git
git clone https://github.com/fairydance/Fragment-SELFIES.git

cd Molexar

conda create -n molexar python=3.13
conda activate molexar

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers accelerate datasets evaluate biopython loguru
conda install -c conda-forge rdkit scipy seaborn

python -m pip install -e ../Fragment-SELFIES
python -m pip install -e . --no-deps

python -c "import fragment_selfies; import molexar; print('Molexar environment ready')"

Install the runtime dependencies listed in the Molexar repository documentation. Fragment-SELFIES is required to convert generated Fragment-SELFIES strings to SMILES. Protein-sequence conditioning also requires the auxiliary ESM embedding environment described by the Molexar repository.

Download

hf download fairydance/molexar-10m-omni --local-dir molexar-10m-omni

Usage

Property-conditioned generation:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --mol_wt 450 \
  --mol_logp 3.5 \
  --mol_hbdc 2 \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical \
  --output_file property_samples.jsonl \
  --output_format jsonl

Pharmacophore-fingerprint conditioning from a reference SMILES:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --condition_key mol_pharma_fp \
  --reference_smiles 'Cc1nnc(N2CCNCC2)s1' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Protein-sequence conditioning:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --protein_sequence 'MKTIIALSYIFCLVFAKDRTEG' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Protein-pocket conditioning:

python scripts/run_inference.py --mode conditional \
  --model_path /path/to/molexar-10m-omni \
  --pocket_pdb /path/to/pocket.pdb \
  --pocket_radius 25 \
  --max_atoms 425 \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Omni also supports fragment-constrained generation with active conditions by combining condition flags with --generation_task and --start_smiles or --start_string. Supported generation tasks are de_novo, motif_extension, scaffold_decoration, linker_design, scaffold_morphing, and superstructure.

Training

Molexar-10M Omni was initialized from Molexar-10M Base and trained with universal multi-condition SFT. The SFT objective masks the prefix through <MOL> and applies loss to the molecular continuation and closing tokens conditioned on the prompt and injected values.

Training script provenance:

examples/train/bjx_h800_sft_universal_multi_unleaky.sh

The SFT data combines molecule-context and target-context samples. Molecule-context samples use the UniChem-derived Fragment-SELFIES corpus with nine scalar properties and a 2D pharmacophore fingerprint. Target-context samples use protein-ligand pairs from SAIR and the PLINDER training set, with protein-sequence ESM embeddings and processed pocket structures. The Molexar paper reports removing target-context training pairs whose protein sequence had more than 30% identity to any CrossDocked2020 test protein; after filtering, the target-context pool contains 573,463 SAIR pair records and 21,770 PLINDER training-set pair records.

Main training settings from the release script and paper:

Setting	Value
Objective	Universal multi-condition supervised fine-tuning
Sequence length	256
Epochs	5
Batch size	1000
Learning rate	2e-4
Warmup steps	2000
Molecule:target sample ratio	4:1
Molecule-side active conditions	1, 2, or 3 conditions with probabilities 0.6, 0.3, 0.1
Pharmacophore oversampling probability	0.5
Mixed precision	bfloat16
Distributed training	Full-shard FSDP on 8 H800 GPUs

Evaluation Highlights

The Molexar paper reports that the SFT model follows single-, dual-, and triple-property instructions and supports pharmacophore, protein-sequence, and pocket-geometry conditioning.

CrossDocked2020 target-conditioned generation highlights:

Conditioning mode	Validity	Uniqueness	Diversity	QED	SA	Lipinski	Vina	High-affinity ratio
Sequence	1.00	0.98	0.83	0.65	0.82	4.74	-7.25	43.1
Pocket	1.00	0.97	0.84	0.65	0.83	4.82	-7.42	53.0
Pharmacophore	1.00	0.91	0.76	0.59	0.71	4.69	-6.79	38.4

On MolGenBench, the paper reports high chemical-filter pass rates, strong active-molecule and scaffold recovery in de novo generation across protein targets, and favorable hit-to-lead potency when conditioning jointly on pocket and reference-ligand pharmacophore.

Intended Use

This model is intended for research use in molecular generation workflows, including:

Property-controlled molecule generation.
Pharmacophore-guided molecule generation.
Protein-sequence-conditioned target-aware generation.
Protein-pocket-conditioned target-aware generation.
Multi-condition molecular library ideation.
Fragment-constrained generation with optional active conditions.

Generated molecules should be treated as computational hypotheses. They require independent chemical-safety filtering, synthetic feasibility assessment, intellectual-property and dual-use review where relevant, expert medicinal-chemistry assessment, and experimental validation before downstream use.

Limitations

The model was trained on filtered drug-like chemistry; rare, contradictory, or out-of-distribution condition combinations may be followed less reliably.
Docking, pharmacophore, property, or sequence/pocket scores are not evidence of biological activity, safety, or clinical utility.
Protein-sequence and pocket conditioning depend on preprocessing quality, including ESM embeddings and pocket structure preparation.
Fragment-SELFIES decoding improves validity but does not guarantee synthetic accessibility, biological activity, safety, or developability.
The released tokenizer does not include the iodine token [I]; use bromine substitution in start constraints when necessary, as documented by the Molexar inference script.
Stereochemical and explicit 3D output control are outside the scope of this model.

License

This model is released under the MIT License.

Citation

If you use this model, please cite Molexar and Fragment-SELFIES:

@misc{lin2026molexar,
  title = {Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design},
  author = {Lin, Haoyu and Liao, Yiyan and Pan, Jinmei and Ling, Xinliao and Lai, Luhua and Pei, Jianfeng},
  year = {2026},
  url = {https://molexar.com}
}

Code and resources:

Downloads last month: 9

Model tree for fairydance/molexar-10m-omni

Base model

fairydance/molexar-10m-base

Finetuned

(1)

this model