File size: 4,324 Bytes
7971ae3
cf004a6
 
 
 
 
7971ae3
cf004a6
7971ae3
cf004a6
7971ae3
 
cf004a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: MHNfs
emoji: 🔬
short_description: Activity prediction for low-data scenarios
colorFrom: gray
colorTo: gray
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: true
---

# Activity Predictions with MHNfs for low-data scenarios

## ⚙️ Under the hood
<div style="text-align: justify">
    The predictive model (MHNfs) used in this application was specifically designed and 
    trained for low-data scenarios. The model predicts whether a molecule is active or
    inactive. The predicted activity value is a continuous value between 0 and 1, and, 
    similar to a probability, the higher/lower the value, the more confident the model 
    is that the molecule is active/inactive.<br>
    <br>
    The model was trained on the FS-Mol dataset which 
    includes 5120 tasks (roughly 5000 tasks were used for training, rest for evaluation). 
    The training tasks are listed here:
    <a href="https://github.com/microsoft/FS-Mol/tree/main/datasets/targets" 
    target="_blank">https://github.com/microsoft/FS-Mol/tree/main/datasets/targets</a>.
</div>

## 🎯 About few-shot learning and the model MHNfs
<div style="text-align: justify">
    <b>Few-shot learning</b> is a machine learning sub-field which aims to provide 
    predictive models for scenarios in which only little data is known/available.<br>
    <br>
    <b>MHNfs</b> is a few-shot learning model which is specifically designed for drug
    discovery applications. It is built to use the input prompts in a way such that 
    the provided available knowledge, i.e. the known active and inactive molecules, 
    functions as context to predict the activity of the new requested molecules. 
    Precisely, the provided active and inactive molecules are associated with a
    large set of general molecules - called context molecules - to enrich the 
    provided information and to remove spurious correlations arising from the 
    decoration of molecules. This is analogous to a Large Language Model which would
    not only use the provided information in the current prompt as context but would
    also have access to way more information, e.g., a prompting history.
    </div>

## 💻 Run the prediction pipeline locally for larger screening chunks

### Get started:
```bash
# Copied from hugging face
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

# Clone repo
git clone https://huggingface.co/spaces/tschouis/mhnfs

# Alternatively, if you want to clone without large files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/spaces/tschouis/mhnfs
```

### Install requirements
```bash
pip install -r requirements.txt 
```
Notably, this command was tested inside a conda environment with python 3.7.

### Run the prediction pipeline:
For your screening, load the model, i.e. the **Activity Predictor** into your python file or notebook and simply run it:
```python
from src.prediction_pipeline load ActivityPredictor

# Define inputs
query_smiles = ["C1CCCCC1", "C1CCCCC1", "C1CCCCC1", "C1CCCCC1"] # Replace with your data
support_actives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data
support_inactives_smiles = ["C1CCCCC1", "C1CCCCC1"] # Replace with your data
    
# Make predictions
predictions = predictor.predict(query_smiles, support_actives_smiles support_inactives_smiles)
```

* Provide molecules in SMILES notation.
* Make sure that the inputs to the Activity Predictor are either comma separated lists, or flattened numpy arrays, or pandas DataFrames. In the latter case, there should be a "smiles" column (both upper and lower case "SMILES" are accepted). All other columns are ignored.



### Run the app locally with streamlib:
```bash
# Navigate into root directory of this project
cd .../whatever_your_dir_name_is/ # Replace with your path

# Run streamlit app
python -m streamlit run
```

## 📚 Cite us 

```
@inproceedings{
    schimunek2023contextenriched,
    title={Context-enriched molecule representations improve few-shot drug discovery},
    author={Johannes Schimunek and Philipp Seidl and Lukas Friedrich and Daniel Kuhn and Friedrich Rippmann and Sepp Hochreiter and Günter Klambauer},
    booktitle={The Eleventh International Conference on Learning Representations},
    year={2023},
    url={https://openreview.net/forum?id=XrMWUuEevr}
}
```