Instructions to use fairydance/molexar-10m-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fairydance/molexar-10m-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="fairydance/molexar-10m-base")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("fairydance/molexar-10m-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use fairydance/molexar-10m-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "fairydance/molexar-10m-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fairydance/molexar-10m-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/fairydance/molexar-10m-base
- SGLang
How to use fairydance/molexar-10m-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "fairydance/molexar-10m-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fairydance/molexar-10m-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "fairydance/molexar-10m-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fairydance/molexar-10m-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use fairydance/molexar-10m-base with Docker Model Runner:
docker model run hf.co/fairydance/molexar-10m-base
Molexar-10M Base
Molexar-10M Base is the unconditional base model for Molexar, a unified multimodal molecular foundation model for drug design. It is trained as an autoregressive molecular language model over Fragment-SELFIES, a BRICS-fragment molecular language with validity-preserving decoding and fragment-continuation support.
This model corresponds to the Unconditional Base Model described in the Molexar paper. It is intended for unconditional molecule generation and fragment-constrained continuation. For property, pharmacophore, protein-sequence, or pocket-conditioned generation, use fairydance/molexar-10m-omni.
Project resources:
- Molexar code: https://github.com/fairydance/Molexar
- Fragment-SELFIES code: https://github.com/fairydance/Fragment-SELFIES
- Official website: https://molexar.com
Model Details
| Field | Value |
|---|---|
| Model family | Molexar molecular causal language model |
| Architecture | Gemma2-style decoder with RoPE, grouped-query attention, sliding-window/full-attention layers, and logit softcapping |
| LM component parameters | 10,534,912 |
| Total model parameters | 14,756,261 |
| Layers | 16 |
| Hidden size | 256 |
| Intermediate size | 640 |
| Attention heads | 4 query heads, 1 key-value head |
| Vocabulary size | 127 |
| Context length | 256 tokens |
| Sliding window | 128 tokens |
| Molecular language | Fragment-SELFIES |
| Model files | config.json, pytorch_model.bin, tokenizer.json, tokenizer_config.json, training_args.bin |
Parameter counts are unique nn.Parameter counts with tied token-embedding/LM-head weights counted once. The LM component includes the token embeddings, Gemma2-style decoder, final normalization, and tied output head; the total additionally includes condition encoders and the pocket GVP encoder.
Molexar uses a shared sequence template for pretraining, SFT, and inference:
<BOS><COND> conditions </COND><SEP><MOL> molecule </MOL><EOS>
During base pretraining, the condition block is present but no condition value is active or injected. This keeps the base model compatible with the same prompt structure used by the universal multi-condition model.
Installation
Install Molexar and Fragment-SELFIES before loading the model:
git clone https://github.com/fairydance/Molexar.git
git clone https://github.com/fairydance/Fragment-SELFIES.git
cd Molexar
conda create -n molexar python=3.13
conda activate molexar
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers accelerate datasets evaluate biopython loguru
conda install -c conda-forge rdkit scipy seaborn
python -m pip install -e ../Fragment-SELFIES
python -m pip install -e . --no-deps
python -c "import fragment_selfies; import molexar; print('Molexar environment ready')"
Install the runtime dependencies listed in the Molexar repository documentation. Fragment-SELFIES is required to convert generated Fragment-SELFIES strings to SMILES.
Download
hf download fairydance/molexar-10m-base --local-dir molexar-10m-base
Usage
Run unconditional generation from the Molexar repository:
python scripts/run_inference.py --mode base \
--model_path /path/to/molexar-10m-base \
--num_samples 10 \
--convert_to_smiles \
--canonical \
--output_file base_samples.jsonl \
--output_format jsonl
Fragment-constrained generation is also supported. For example, motif extension from a SMILES fragment:
python scripts/run_inference.py --mode base \
--model_path /path/to/molexar-10m-base \
--generation_task motif_extension \
--start_smiles '[*]C1(CC#N)CN(S(=O)(=O)CC)C1' \
--num_samples 10 \
--convert_to_smiles \
--canonical
Supported generation tasks are de_novo, motif_extension, scaffold_decoration, linker_design, scaffold_morphing, and superstructure.
Training
The base model was pretrained on randomized Fragment-SELFIES strings using causal language modeling. The training corpus is derived from UniChem, with SAIR and PLINDER training-set ligand coverage included, after canonicalization, deduplication, and drug-likeness/safety filtering. In the Molexar paper, this corpus contains 135,763,524 Fragment-SELFIES records corresponding to 33,940,881 molecule-condition rows with four randomized folds.
Training script provenance:
examples/train/bjx_h800_pretrain_base.sh
Main training settings from the release script and paper:
| Setting | Value |
|---|---|
| Objective | Causal language modeling |
| Sequence length | 256 |
| Epochs | 2 |
| Batch size | 1000 |
| Learning rate | 2e-4 |
| Warmup steps | 2000 |
| Mixed precision | bfloat16 |
| Distributed training | Full-shard FSDP on 8 H800 GPUs |
Evaluation Highlights
From the Molexar paper, unconditional generation on 10,000 samples:
| Model | Validity | Uniqueness | Diversity | Quality |
|---|---|---|---|---|
| Molexar-10M Base | 1.0000 | 0.9997 | 0.8824 | 0.8326 |
Quality is defined as the fraction of generated molecules with QED >= 0.6 and SAS <= 4. The model also reached 100% validity across the reported fragment-constrained tasks, with strong drug-like quality for motif extension, linker design, scaffold morphing, scaffold decoration, and superstructure generation.
Intended Use
This model is intended for research use in molecular generation workflows, including:
- Unconditional Fragment-SELFIES molecule generation.
- Fragment continuation and fragment-constrained ideation.
- Baseline initialization for further Molexar fine-tuning.
- Computational chemistry research on molecular language models.
Generated molecules should be treated as computational hypotheses. They require independent chemical-safety filtering, synthetic feasibility assessment, intellectual-property and dual-use review where relevant, expert medicinal-chemistry assessment, and experimental validation before downstream use.
Limitations
- The base model is not trained to obey active property, pharmacophore, protein-sequence, or pocket conditions.
- Molexar generates molecular strings, not experimentally validated compounds.
- Fragment-SELFIES decoding improves validity but does not guarantee synthetic accessibility, biological activity, safety, or developability.
- The released tokenizer does not include the iodine token
[I]; use bromine substitution in start constraints when necessary, as documented by the Molexar inference script. - Stereochemical and explicit 3D output control are outside the scope of this model.
License
This model is released under the MIT License.
Citation
If you use this model, please cite Molexar and Fragment-SELFIES:
@misc{lin2026molexar,
title = {Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design},
author = {Lin, Haoyu and Liao, Yiyan and Pan, Jinmei and Ling, Xinliao and Lai, Luhua and Pei, Jianfeng},
year = {2026},
url = {https://molexar.com}
}
Code and resources:
- Downloads last month
- 11