mhnfs / README.md
Tschoui's picture
Update README.md
a401549 verified
metadata
title: MHNfs
emoji: πŸ”¬
short_description: Activity prediction for low-data scenarios
colorFrom: gray
colorTo: gray
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: true

Activity Predictions with MHNfs for low-data scenarios

βš™οΈ Under the hood

The predictive model (MHNfs) used in this application was specifically designed and trained for low-data scenarios. The model predicts whether a molecule is active or inactive. The predicted activity value is a continuous value between 0 and 1, and, similar to a probability, the higher/lower the value, the more confident the model is that the molecule is active/inactive.

The model was trained on the FS-Mol dataset which includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation). The training tasks are listed here: https://github.com/microsoft/FS-Mol/tree/main/datasets/targets.

🎯 About few-shot learning and the model MHNfs

Few-shot learning is a machine learning sub-field which aims to provide predictive models for scenarios in which only little data is known/available.

MHNfs is a few-shot learning model which is specifically designed for drug discovery applications. It is built to use the input prompts in a way such that the provided available knowledge, i.e. the known active and inactive molecules, functions as context to predict the activity of the new requested molecules. Precisely, the provided active and inactive molecules are associated with a large set of general molecules - called context molecules - to enrich the provided information and to remove spurious correlations arising from the decoration of molecules. This is analogous to a Large Language Model which would not only use the provided information in the current prompt as context but would also have access to way more information, e.g., a prompting history.

πŸ’» Run the prediction pipeline locally for larger screening chunks

Get started:

# Copied from hugging face
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

# Clone repo
git clone https://huggingface.co/spaces/tschouis/mhnfs

# Alternatively, if you want to clone without large files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/spaces/tschouis/mhnfs

Install requirements

pip install -r requirements.txt 

Notably, this command was tested inside a conda environment with python 3.7.

Run the prediction pipeline:

For your screening, load the model, i.e. the Activity Predictor into your python file or notebook and simply run it:

from src.prediction_pipeline load ActivityPredictor

# Define inputs
query_smiles = ["C1CCCCC1", "C1CCCCC1", "C1CCCCC1", "C1CCCCC1"] # Replace with your data
support_actives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data
support_inactives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data
    
# Make predictions
predictions = predictor.predict(query_smiles, support_actives_smiles support_inactives_smiles)
  • Provide molecules in SMILES notation.
  • Make sure that the inputs to the Activity Predictor are either comma separated lists, or flattened numpy arrays, or pandas DataFrames. In the latter case, there should be a "smiles" column (both upper and lower case "SMILES" are accepted). All other columns are ignored.

Run the app locally with streamlib:

# Navigate into root directory of this project
cd .../whatever_your_dir_name_is/ # Replace with your path

# Run streamlit app
python -m streamlit run

πŸ“š Cite us

@inproceedings{
    schimunek2023contextenriched,
    title={Context-enriched molecule representations improve few-shot drug discovery},
    author={Johannes Schimunek and Philipp Seidl and Lukas Friedrich and Daniel Kuhn and Friedrich Rippmann and Sepp Hochreiter and GΓΌnter Klambauer},
    booktitle={The Eleventh International Conference on Learning Representations},
    year={2023},
    url={https://openreview.net/forum?id=XrMWUuEevr}
}