Molexar-10M Base

Molexar-10M Base is the unconditional base model for Molexar, a unified multimodal molecular foundation model for drug design. It is trained as an autoregressive molecular language model over Fragment-SELFIES, a BRICS-fragment molecular language with validity-preserving decoding and fragment-continuation support.

This model corresponds to the Unconditional Base Model described in the Molexar paper. It is intended for unconditional molecule generation and fragment-constrained continuation. For property, pharmacophore, protein-sequence, or pocket-conditioned generation, use fairydance/molexar-10m-omni.

Project resources:

Molexar code: https://github.com/fairydance/Molexar
Fragment-SELFIES code: https://github.com/fairydance/Fragment-SELFIES
Official website: https://molexar.com

Model Details

Field	Value
Model family	Molexar molecular causal language model
Architecture	Gemma2-style decoder with RoPE, grouped-query attention, sliding-window/full-attention layers, and logit softcapping
LM component parameters	10,534,912
Total model parameters	14,756,261
Layers	16
Hidden size	256
Intermediate size	640
Attention heads	4 query heads, 1 key-value head
Vocabulary size	127
Context length	256 tokens
Sliding window	128 tokens
Molecular language	Fragment-SELFIES
Model files	`config.json`, `pytorch_model.bin`, `tokenizer.json`, `tokenizer_config.json`, `training_args.bin`

Parameter counts are unique nn.Parameter counts with tied token-embedding/LM-head weights counted once. The LM component includes the token embeddings, Gemma2-style decoder, final normalization, and tied output head; the total additionally includes condition encoders and the pocket GVP encoder.

Molexar uses a shared sequence template for pretraining, SFT, and inference:

<BOS><COND> conditions </COND><SEP><MOL> molecule </MOL><EOS>

During base pretraining, the condition block is present but no condition value is active or injected. This keeps the base model compatible with the same prompt structure used by the universal multi-condition model.

Installation

Install Molexar and Fragment-SELFIES before loading the model:

git clone https://github.com/fairydance/Molexar.git
git clone https://github.com/fairydance/Fragment-SELFIES.git

cd Molexar

conda create -n molexar python=3.13
conda activate molexar

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers accelerate datasets evaluate biopython loguru
conda install -c conda-forge rdkit scipy seaborn

python -m pip install -e ../Fragment-SELFIES
python -m pip install -e . --no-deps

python -c "import fragment_selfies; import molexar; print('Molexar environment ready')"

Install the runtime dependencies listed in the Molexar repository documentation. Fragment-SELFIES is required to convert generated Fragment-SELFIES strings to SMILES.

Download

hf download fairydance/molexar-10m-base --local-dir molexar-10m-base

Usage

Run unconditional generation from the Molexar repository:

python scripts/run_inference.py --mode base \
  --model_path /path/to/molexar-10m-base \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical \
  --output_file base_samples.jsonl \
  --output_format jsonl

Fragment-constrained generation is also supported. For example, motif extension from a SMILES fragment:

python scripts/run_inference.py --mode base \
  --model_path /path/to/molexar-10m-base \
  --generation_task motif_extension \
  --start_smiles '[*]C1(CC#N)CN(S(=O)(=O)CC)C1' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Supported generation tasks are de_novo, motif_extension, scaffold_decoration, linker_design, scaffold_morphing, and superstructure.

Training

The base model was pretrained on randomized Fragment-SELFIES strings using causal language modeling. The training corpus is derived from UniChem, with SAIR and PLINDER training-set ligand coverage included, after canonicalization, deduplication, and drug-likeness/safety filtering. In the Molexar paper, this corpus contains 135,763,524 Fragment-SELFIES records corresponding to 33,940,881 molecule-condition rows with four randomized folds.

Training script provenance:

examples/train/bjx_h800_pretrain_base.sh

Main training settings from the release script and paper:

Setting	Value
Objective	Causal language modeling
Sequence length	256
Epochs	2
Batch size	1000
Learning rate	2e-4
Warmup steps	2000
Mixed precision	bfloat16
Distributed training	Full-shard FSDP on 8 H800 GPUs

Evaluation Highlights

From the Molexar paper, unconditional generation on 10,000 samples:

Model	Validity	Uniqueness	Diversity	Quality
Molexar-10M Base	1.0000	0.9997	0.8824	0.8326

Quality is defined as the fraction of generated molecules with QED >= 0.6 and SAS <= 4. The model also reached 100% validity across the reported fragment-constrained tasks, with strong drug-like quality for motif extension, linker design, scaffold morphing, scaffold decoration, and superstructure generation.

Intended Use

This model is intended for research use in molecular generation workflows, including:

Unconditional Fragment-SELFIES molecule generation.
Fragment continuation and fragment-constrained ideation.
Baseline initialization for further Molexar fine-tuning.
Computational chemistry research on molecular language models.

Generated molecules should be treated as computational hypotheses. They require independent chemical-safety filtering, synthetic feasibility assessment, intellectual-property and dual-use review where relevant, expert medicinal-chemistry assessment, and experimental validation before downstream use.

Limitations

The base model is not trained to obey active property, pharmacophore, protein-sequence, or pocket conditions.
Molexar generates molecular strings, not experimentally validated compounds.
Fragment-SELFIES decoding improves validity but does not guarantee synthetic accessibility, biological activity, safety, or developability.
The released tokenizer does not include the iodine token [I]; use bromine substitution in start constraints when necessary, as documented by the Molexar inference script.
Stereochemical and explicit 3D output control are outside the scope of this model.

License

This model is released under the MIT License.

Citation

If you use this model, please cite Molexar and Fragment-SELFIES:

@misc{lin2026molexar,
  title = {Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design},
  author = {Lin, Haoyu and Liao, Yiyan and Pan, Jinmei and Ling, Xinliao and Lai, Luhua and Pei, Jianfeng},
  year = {2026},
  url = {https://molexar.com}
}

Code and resources:

Downloads last month: 11

Model tree for fairydance/molexar-10m-base

Finetunes

1 model