--- title: MHNfs emoji: 🔬 short_description: Activity prediction for low-data scenarios colorFrom: gray colorTo: gray sdk: streamlit sdk_version: 1.29.0 app_file: app.py pinned: true --- # Activity Predictions with MHNfs for low-data scenarios ## ⚙️ Under the hood
The predictive model (MHNfs) used in this application was specifically designed and trained for low-data scenarios. The model predicts whether a molecule is active or inactive. The predicted activity value is a continuous value between 0 and 1, and, similar to a probability, the higher/lower the value, the more confident the model is that the molecule is active/inactive.

The model was trained on the FS-Mol dataset which includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation). The training tasks are listed here: https://github.com/microsoft/FS-Mol/tree/main/datasets/targets.
## 🎯 About few-shot learning and the model MHNfs
Few-shot learning is a machine learning sub-field which aims to provide predictive models for scenarios in which only little data is known/available.

MHNfs is a few-shot learning model which is specifically designed for drug discovery applications. It is built to use the input prompts in a way such that the provided available knowledge, i.e. the known active and inactive molecules, functions as context to predict the activity of the new requested molecules. Precisely, the provided active and inactive molecules are associated with a large set of general molecules - called context molecules - to enrich the provided information and to remove spurious correlations arising from the decoration of molecules. This is analogous to a Large Language Model which would not only use the provided information in the current prompt as context but would also have access to way more information, e.g., a prompting history.
## 💻 Run the prediction pipeline locally for larger screening chunks ### Get started: ```bash # Copied from hugging face # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install # Clone repo git clone https://huggingface.co/spaces/tschouis/mhnfs # Alternatively, if you want to clone without large files GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/spaces/tschouis/mhnfs ``` ### Install requirements ```bash pip install -r requirements.txt ``` Notably, this command was tested inside a conda environment with python 3.7. ### Run the prediction pipeline: For your screening, load the model, i.e. the **Activity Predictor** into your python file or notebook and simply run it: ```python from src.prediction_pipeline load ActivityPredictor # Define inputs query_smiles = ["C1CCCCC1", "C1CCCCC1", "C1CCCCC1", "C1CCCCC1"] # Replace with your data support_actives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data support_inactives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data # Make predictions predictions = predictor.predict(query_smiles, support_actives_smiles support_inactives_smiles) ``` * Provide molecules in SMILES notation. * Make sure that the inputs to the Activity Predictor are either comma separated lists, or flattened numpy arrays, or pandas DataFrames. In the latter case, there should be a "smiles" column (both upper and lower case "SMILES" are accepted). All other columns are ignored. ### Run the app locally with streamlib: ```bash # Navigate into root directory of this project cd .../whatever_your_dir_name_is/ # Replace with your path # Run streamlit app python -m streamlit run ``` ## 📚 Cite us ``` @inproceedings{ schimunek2023contextenriched, title={Context-enriched molecule representations improve few-shot drug discovery}, author={Johannes Schimunek and Philipp Seidl and Lukas Friedrich and Daniel Kuhn and Friedrich Rippmann and Sepp Hochreiter and Günter Klambauer}, booktitle={The Eleventh International Conference on Learning Representations}, year={2023}, url={https://openreview.net/forum?id=XrMWUuEevr} } ```