Molexar-10M Base

Molexar-10M Base is the unconditional base model for Molexar, a unified multimodal molecular foundation model for drug design. It is trained as an autoregressive molecular language model over Fragment-SELFIES, a BRICS-fragment molecular language with validity-preserving decoding and fragment-continuation support.

This model corresponds to the Unconditional Base Model described in the Molexar paper. It is intended for unconditional molecule generation and fragment-constrained continuation. For property, pharmacophore, protein-sequence, or pocket-conditioned generation, use fairydance/molexar-10m-omni.

Project resources:

Model Details

Field Value
Model family Molexar molecular causal language model
Architecture Gemma2-style decoder with RoPE, grouped-query attention, sliding-window/full-attention layers, and logit softcapping
LM component parameters 10,534,912
Total model parameters 14,756,261
Layers 16
Hidden size 256
Intermediate size 640
Attention heads 4 query heads, 1 key-value head
Vocabulary size 127
Context length 256 tokens
Sliding window 128 tokens
Molecular language Fragment-SELFIES
Model files config.json, pytorch_model.bin, tokenizer.json, tokenizer_config.json, training_args.bin

Parameter counts are unique nn.Parameter counts with tied token-embedding/LM-head weights counted once. The LM component includes the token embeddings, Gemma2-style decoder, final normalization, and tied output head; the total additionally includes condition encoders and the pocket GVP encoder.

Molexar uses a shared sequence template for pretraining, SFT, and inference:

<BOS><COND> conditions </COND><SEP><MOL> molecule </MOL><EOS>

During base pretraining, the condition block is present but no condition value is active or injected. This keeps the base model compatible with the same prompt structure used by the universal multi-condition model.

Installation

Install Molexar and Fragment-SELFIES before loading the model:

git clone https://github.com/fairydance/Molexar.git
git clone https://github.com/fairydance/Fragment-SELFIES.git

cd Molexar

conda create -n molexar python=3.13
conda activate molexar

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers accelerate datasets evaluate biopython loguru
conda install -c conda-forge rdkit scipy seaborn

python -m pip install -e ../Fragment-SELFIES
python -m pip install -e . --no-deps

python -c "import fragment_selfies; import molexar; print('Molexar environment ready')"

Install the runtime dependencies listed in the Molexar repository documentation. Fragment-SELFIES is required to convert generated Fragment-SELFIES strings to SMILES.

Download

hf download fairydance/molexar-10m-base --local-dir molexar-10m-base

Usage

Run unconditional generation from the Molexar repository:

python scripts/run_inference.py --mode base \
  --model_path /path/to/molexar-10m-base \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical \
  --output_file base_samples.jsonl \
  --output_format jsonl

Fragment-constrained generation is also supported. For example, motif extension from a SMILES fragment:

python scripts/run_inference.py --mode base \
  --model_path /path/to/molexar-10m-base \
  --generation_task motif_extension \
  --start_smiles '[*]C1(CC#N)CN(S(=O)(=O)CC)C1' \
  --num_samples 10 \
  --convert_to_smiles \
  --canonical

Supported generation tasks are de_novo, motif_extension, scaffold_decoration, linker_design, scaffold_morphing, and superstructure.

Training

The base model was pretrained on randomized Fragment-SELFIES strings using causal language modeling. The training corpus is derived from UniChem, with SAIR and PLINDER training-set ligand coverage included, after canonicalization, deduplication, and drug-likeness/safety filtering. In the Molexar paper, this corpus contains 135,763,524 Fragment-SELFIES records corresponding to 33,940,881 molecule-condition rows with four randomized folds.

Training script provenance:

examples/train/bjx_h800_pretrain_base.sh

Main training settings from the release script and paper:

Setting Value
Objective Causal language modeling
Sequence length 256
Epochs 2
Batch size 1000
Learning rate 2e-4
Warmup steps 2000
Mixed precision bfloat16
Distributed training Full-shard FSDP on 8 H800 GPUs

Evaluation Highlights

From the Molexar paper, unconditional generation on 10,000 samples:

Model Validity Uniqueness Diversity Quality
Molexar-10M Base 1.0000 0.9997 0.8824 0.8326

Quality is defined as the fraction of generated molecules with QED >= 0.6 and SAS <= 4. The model also reached 100% validity across the reported fragment-constrained tasks, with strong drug-like quality for motif extension, linker design, scaffold morphing, scaffold decoration, and superstructure generation.

Intended Use

This model is intended for research use in molecular generation workflows, including:

  • Unconditional Fragment-SELFIES molecule generation.
  • Fragment continuation and fragment-constrained ideation.
  • Baseline initialization for further Molexar fine-tuning.
  • Computational chemistry research on molecular language models.

Generated molecules should be treated as computational hypotheses. They require independent chemical-safety filtering, synthetic feasibility assessment, intellectual-property and dual-use review where relevant, expert medicinal-chemistry assessment, and experimental validation before downstream use.

Limitations

  • The base model is not trained to obey active property, pharmacophore, protein-sequence, or pocket conditions.
  • Molexar generates molecular strings, not experimentally validated compounds.
  • Fragment-SELFIES decoding improves validity but does not guarantee synthetic accessibility, biological activity, safety, or developability.
  • The released tokenizer does not include the iodine token [I]; use bromine substitution in start constraints when necessary, as documented by the Molexar inference script.
  • Stereochemical and explicit 3D output control are outside the scope of this model.

License

This model is released under the MIT License.

Citation

If you use this model, please cite Molexar and Fragment-SELFIES:

@misc{lin2026molexar,
  title = {Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design},
  author = {Lin, Haoyu and Liao, Yiyan and Pan, Jinmei and Ling, Xinliao and Lai, Luhua and Pei, Jianfeng},
  year = {2026},
  url = {https://molexar.com}
}

Code and resources:

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fairydance/molexar-10m-base

Finetunes
1 model