# Model documentation & parameters

## Parameters

### Algorithm Version
Which model checkpoint to use (trained on different datasets).

### Task
Whether the multitask model should be used for property prediction or conditional generation (default).

### Input
The input sequence. In the default setting (where `Task` is *Generate* and `Sampling Wrapper` is *True*) this can be a seed SMILES (for the molecule models) or amino-acid sequence (for the protein models). The model will locally adapt the seed sequence by masking `Fraction to mask` of the tokens.
If the `Task` is *Predict*, the sequences are given as SELFIES for the molecule models. Moreover, the tokens that should be predicted (`[MASK]` in the input) have to be given explicitly. Populate the examples to understand better. 
NOTE: When setting `Task` to *Generate*, and `Sampling Wrapper` to *False*, the user has maximal control about the generative process and can explicitly decide which tokens should be masked.

### Number of samples
How many samples should be generated (between 1 and 50). If `Task` is *Predict*, this has to be set to 1.

### Search
Decoding search method. Use *Sample* if `Task` is *Generate*. If `Task` is *Predict*, use *Greedy*.

### Tolerance
Precision tolerance; only used if `Task` is *Generate*. This is a single float between 0 and 100 for the the tolerated deviation between desired/primed property and predicted property of the generated molecule. Given in percentage with respect to the property range encountered during training.
NOTE: The tolerance is *only* used for post-hoc filtering of the generated samples.

### Sampling Wrapper
Only used if `Task` is *Generate*. If set to *False*, the user has to provide a full RT-sequence as `Input` and has to **explicitly** decide which tokens are masked (see example below). This gives full control but is tedious. Instead, if `Sampling Wrapper` is set to *True*, the RT stochastically determines which parts of the sequence are masked.
**NOTE**: All below arguments only apply if `Sampling Wrapper` is *True*.

#### Fraction to mask
Specifies the ratio of tokens that can be changed by the model. Argument only applies if `Task` is *Generate* and `Sampling Wrapper` is *True*.

#### Property goal
Specifies the desired target properties for the generation. Need to be given in the format `<prop>:value`. If the model supports multiple properties, give them separated by a comma `,`. Argument only applies if `Task` is *Generate* and `Sampling Wrapper` is *True*.

#### Tokens to mask
Optionally specifies which tokens (atoms, bonds etc) can be masked. Please separate multiple tokens by comma (`,`). If not specified, all tokens can be masked. Argument only applies if `Task` is *Generate* and `Sampling Wrapper` is *True*.

#### Substructures to mask
Optionally specifies a list of substructures that should *definitely* be masked (excluded from stochastic masking). Given in SMILES format. If multiple are provided, separate by comma (`,`). Argument only applies if `Task` is *Generate* and `Sampling Wrapper` is *True*.
*NOTE*: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.

#### Substructures to keep
Optionally specifies a list of substructures that should definitely be present in the target sample (i.e., excluded from stochastic masking). Given in SMILES format. Argument only applies if `Task` is *Generate* and `Sampling Wrapper` is *True*.
*NOTE*: This keeps tokens even if they are included in `tokens_to_mask`.
*NOTE*: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.


# Model card -- Regression Transformer

**Model Details**: The [Regression Transformer](https://www.nature.com/articles/s42256-023-00639-z) is a multitask Transformer that reformulates regression as a conditional sequence modeling task. This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation.

**Developers**: Jannis Born and Matteo Manica from IBM Research.

**Distributors**: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.

**Model date**: Preprint released in 2022, currently under review at *Nature Machine Intelligence*.

**Algorithm version**: Models trained and distributed by the original authors.
- **Molecules: QED**: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
- **Molecules: Solubility**: QED model finetuned on the ESOL dataset from [Delaney et al (2004), *J. Chem. Inf. Comput. Sci.*](https://pubs.acs.org/doi/10.1021/ci034243x) to predict water solubility. Model trained on augmented SELFIES.
- **Molecules: Cosmo_acdl**: Model finetuned on 56k molecules with two properties (*pKa_ACDL* and *pKa_COSMO*). Model used augmented SELFIES.
- **Molecules: Pfas**: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
- **Molecules: Logp_and_synthesizability**: Model trained on 2.9M molecules (SELFIES) from PubChem with **two** synthetic properties, the logP (partition coefficient) and the [SCScore by Coley et al. (2018); *J. Chem. Inf. Model.*](https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00622?casa_token=JZzOrdWlQ_QAAAAA%3A3_ynCfBJRJN7wmP2gyAR0EWXY-pNW_l-SGwSSU2SGfl5v5SxcvqhoaPNDhxq4THberPoyyYqTZELD4Ck)
- **Molecules: Crippen_logp**: Model trained on 2.9M molecules (SMILES) from PubChem, but *only* on logP (partition coefficient).
- **Molecules: Reactions: USPTO**: Model trained on 2.8M [chemical reactions](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
- **Molecules: Polymers: ROP Catalyst**: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (`<conv>`), PDI (`<pdi>`) and Molecular Weight (`<molwt>`). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. Try the above UI example and see [Park et al., (2022, ChemRxiv)](https://chemrxiv.org/engage/chemrxiv/article-details/62b60865e84dd185e60214af) for details.
- **Molecules: Polymers: Block copolymer**: Model finetuned on ~1k block copolymers with a novel string representation developed for Polymers. Model used two properties: dispersity (`<Dispersity>`) and MnGPC (`<MnGPC>`). This is the first generative model for block copolymers. Try the above UI example and see [Park et al., (2022, ChemRxiv)](https://chemrxiv.org/engage/chemrxiv/article-details/62b60865e84dd185e60214af) for details.
- **Proteins: Stability**: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the [**Stability**](https://www.science.org/doi/full/10.1126/science.aan0693) dataset from the [TAPE benchmark](https://proceedings.neurips.cc/paper/2019/hash/37f65c068b7723cd7809ee2d31d7861c-Abstract.html) which has ~65k samples.

**Model type**: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.

**Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**: 
All models are trained with an alternated training scheme that alternated between optimizing the cross-entropy loss on the property tokens ("regression") or the self-consistency objective on the molecular tokens. See the [Regression Transformer](https://arxiv.org/abs/2202.01338) paper for details.

**Paper or other resource for more information**: 
The [Regression Transformer](https://arxiv.org/abs/2202.01338) paper. See the [source code](https://github.com/IBM/regression-transformer) for details.

**License**: MIT

**Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core).

**Intended Use. Use cases that were envisioned during development**: Chemical research, in particular drug discovery.

**Primary intended uses/users**: Researchers and computational chemists using the model for model comparison or research exploration purposes.

**Out-of-scope use cases**: Production-level inference, producing molecules with harmful properties.

**Factors**: Not applicable.

**Metrics**: High predictive power for the properties of that specific algorithm version.

**Datasets**: Different ones, as described under **Algorithm version**.

**Ethical Considerations**: No specific considerations as no private/personal data is involved. Please consult with the authors in case of questions.

**Caveats and Recommendations**: Please consult the authors in case of questions.

Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)


## Citation

```bib
@article{born2023regression,
  title={Regression Transformer enables concurrent sequence regression and generation for molecular language modelling},
  author={Born, Jannis and Manica, Matteo},
  journal={Nature Machine Intelligence},
  volume={5},
  number={4},
  pages={432--444},
  year={2023},
  publisher={Nature Publishing Group UK London}
}
```