File size: 13,165 Bytes
69fed0d
 
6ad0288
 
 
 
 
 
 
c76c80a
 
69fed0d
 
6c7905e
 
 
 
 
 
 
 
6ad0288
 
 
a2243fb
 
 
6ad0288
33bec02
69fed0d
431933a
 
2629133
431933a
 
69fed0d
4112d91
33bec02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4112d91
431933a
4112d91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431933a
4112d91
 
 
 
 
 
431933a
 
9e240ab
4112d91
 
ce7f6c2
 
4112d91
 
 
 
33bec02
431933a
 
4112d91
 
33bec02
4112d91
431933a
69fed0d
 
e9b88a7
8ba0510
e9b88a7
 
 
 
 
 
 
 
 
 
 
 
33bec02
8ba0510
e9b88a7
33bec02
 
69fed0d
 
431933a
 
 
 
77235d1
 
431933a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15b8ca1
431933a
69fed0d
33bec02
69fed0d
33bec02
 
 
 
 
69fed0d
 
33bec02
69fed0d
33bec02
69fed0d
33bec02
 
 
 
 
 
 
69fed0d
 
 
363f05e
33bec02
 
 
69fed0d
 
e9b88a7
 
 
33bec02
e9b88a7
33bec02
 
 
e9b88a7
33bec02
 
e9b88a7
 
 
 
678fd38
431933a
e9b88a7
431933a
 
 
 
 
e9b88a7
431933a
e9b88a7
 
431933a
e9b88a7
33bec02
431933a
 
 
 
 
 
 
 
678fd38
431933a
 
e9b88a7
33bec02
e9b88a7
33bec02
 
 
 
 
678fd38
e9b88a7
 
 
 
678fd38
e9b88a7
9e240ab
e9b88a7
 
431933a
e9b88a7
 
 
 
 
 
 
ce7f6c2
 
e9b88a7
 
 
 
431933a
33bec02
431933a
 
 
 
 
 
 
 
678fd38
431933a
 
 
33bec02
e9b88a7
 
a44acf6
33bec02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
license: cc-by-nc-4.0
language:
- en
tags:
- drug discovery
- chemistry
- biology
- omics
datasets:
- f-galkin/batman2
---

<p align="left">
  <sub>
  <b>What's new?</b>
  <br>
  🐍📓 Geroprotector identification in TCM <a href="https://huggingface.co/datasets/f-galkin/batman2/blob/main/demo/HF_TCM%2BP3GPT_sceening.ipynb">Jupyter Notebook</a> (13/8/2024)
  </sub>
  <br style="line-height: 5px" />
</p>
<h1 align="center"> Precious3GPT </h1>
<h3 align="center"> A multimodal multi-omics multi-species multi-tissue language model. </h3>
<p align="center">
  📃 <a href="https://doi.org/10.1101/2024.07.25.605062" target="_blank">Pre-print</a> • 👾 <a href="https://discord.gg/P4PWFNbYFg" target="_blank">Discord bot</a> • 🧬 <a href="https://insilico.com/repository/precious3gpt" target="_blank">Validation digest</a>
  <br>
  𝕏 <a href="https://x.com/precious_gpt" target="_blank">@precious_gpt</a>
</p>
<div align=center><img src="P3GPT_architecture.png" width="80%" height="80%" /></div>

- **Developer**: [Insilico Medicine](https://insilico.com/precious)
- **License**: cc-by-nc-4.0
- **Model size**: 175.1 billion parameters (Core Model 89.4 million, Text modality 175 billion, Knowledge Graph modality 8.2 million)
- **Domain**: Biomedical
- **Base architecture**: [MPT](https://huggingface.co/mosaicml/mpt-7b)


<h1 align="center"> Model summary </h1>

- Precious3GPT (P3GPT) is a unique language model that has been trained on 1.2MM omics data points, knowledge graphs, and biomedical texts (PubMed) to be used in drug discovery and aging research;

- P3GPT simulates biological processes on an omics level to return the transcriptomic, epigenetic, or proteomic signatures of a wide variety of perturbators;

- Various modes of execution allow users to replicate the workflows of chemical screenings, case-control observational studies, and other popular research settings;

- The context of P3GPT-simulated experiments can be defined with >60k biomedical entities, including 3 species, 569 tissues and cell lines, 635 health conditions, and 22k small molecules;

- You may work with P3GPT either by downloading model weights for a local deployment or by interacting with the Discord bot on the official Inisilico Medicine's server.

<h1 align="center"> Model usage guide </h1>

### Run model with an endpoint
<details>
  <summary style="font-weight:600">Details</summary>

**Step 1 - connect to endpoint**
```python

import requests

API_URL = "https://cu2s6lgb4jew3tht.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
    "Accept" : "application/json",
    "Authorization": "Bearer hf_XXXX",
    "Content-Type": "application/json" 
}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

```

**Step 2 - create input for endpoint**

```python
import json
with open('./generation-configs/meta2diff.json', 'r') as f:
    config_data = json.load(f)

# prepare request configuration
request_config = {"inputs": config_data, "mode": "meta2diff", "parameters": {
    "temperature": 0.8,
    "top_p": 0.2,
    "top_k": 3550,
    "n_next_tokens": 50,
    "random_seed": 137
}}

```

**Actual request processed by Precisou3GPT**
```text
[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>
```

**Step 3 - send the request to endpoint**
```python
output = query(request_config)
```


**Endpoint output structure**
```json
{
    "output": {
        "up": List, 
        "down": List
    },
    "mode": String, // Generation mode was selected
    "message": "Done!",  // or Error
    "input": String // Input prompt was passed

}
```

Note: If the ```mode``` was supposed to generate compounds, the output would contain ```compounds: List```.

</details>

---

### Run model locally
<details>
  <summary style="font-weight:600">Details</summary>
    
    
**Requirements:** ```torch==2.0.1 einops==0.7.0 huggingface-hub==0.20.1 transformers==4.35.0```

 1. Download the repository https://huggingface.co/insilicomedicine/precious3-gpt-multi-modal
    
 2. Inside the repository execute:
```python

# init handler
from handler import EndpointHandler
precious3gpt_handler = EndpointHandler(path='./')

import json
with open('./generation-configs/meta2diff.json', 'r') as f:
    config_data = json.load(f)

# prepare request configuration
request_config = {"inputs": config_data, 
                  "mode": "meta2diff", 
                  "parameters": {
    "temperature": 0.8,
    "top_p": 0.2,
    "top_k": 3550,
    "n_next_tokens": 50,
    "random_seed": 137
}}

output = precious3gpt_handler(request_config)

```
</details>
    
---
## Precious3GPT request configuration


### Instruction (`inputs.instruction` in `config`)

Instructions define the experimental setting P3GPT will be simulating using the information provided in the prompt. 

1. `disease2diff2disease` - generate an omics signature characterizing a disease / determine the disease based on a given signature;
2. `compound2diff2compound` - generate an omics signature of a compound-induced perturbation / determine the compound given its omics signature;
3. `age_group2diff2age_group` - generate differential omics for age groups / determine age groups provided differential gene lists


### Generation Modes (`mode` in config)

Generation modes are not part of the prompt processed to P3GPT but may affect the way P3GPT's response is presented or processed:

1. `meta2diff`: The ```compound2diff2compound``` instruction can be executed either way. This mode tells P3GPT to return differentially expressed genes and not compounds;
2. `diff2compound`: The reverse of the ```meta2diff``` mode. Make sure to fill in 'up' and 'down' in the prompt first!
3. `meta2diff2compound`: Runs ```meta2diff``` first and applies ```diff2compound``` to its output with one call.

See ```Precious3GPT_example.ipynb``` tutorial notebook to learn more about building P3GPT requests.

---


### Other meta-data (`inputs.` in config)

P3GPT can only simulate the experiments featuring the biomedical entities and metadata values present in ```p3_entities_with_type.csv```

If you aim to study a tissue, a compound, or something else using P3GPT, make sure to check that the names of the entities you are using match those in this file. 



## Examples

In the following examples, all possible configuration fields are specified. You can leave some meta-data fields in the ```inputs``` section empty string(```""```) or empty list(```[]```). 

_**Example 1: generate a disease signature**_
<details>
  <summary style="font-weight:600">Details</summary>

If you want to generate a signature given specific metadata you can use the following configuration. Note, ```up``` and ```down``` fields are empty lists as you want to generate them. 
Here, we ask the model to generate a signature for a male human within in the 70-90 years age group, in the "lung" tissue, with "EFO_0000768" (Idiopathic pulmonary fibrosis).

```json
{
    "inputs": {
        "instruction": ["age_group2diff2age_group", "disease2diff2disease"], 
        "tissue": ["lung"],
        "age": "",
        "cell": "", 
        "efo": "EFO_0000768", 
        "datatype": "", "drug": "", "dose": "", "time": "", "case": ["70.0-80.0", "80.0-90.0"], "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [], "down": []
    }, 
    "mode": "meta2diff", 
    "parameters": {
        "temperature": 0.8, "top_p": 0.2, "top_k": 3550, "n_next_tokens": 50, "random_seed": 137
    }
}
```

See the corresponding P3GPT output:
```json
{
  "output": {
    "up": [["PTGDR2", "CABYR", "MGAM", "TMED9", "SHOX2", "MAT1A", "MUC5AC", "GASK1B", "CYP1A2", "RP11-266K4.9", ...]], // generated list of up-regulated genes
    "down": [["MB", "OR10V1", "OR51H1", "GOLGA6L10", "OR6M1", "CDX4", "OR4C45", "SPRR2A", "SPDYE9", "GBX2", "ATP4B", ...]] // generated list of down-regulated genes
  },
  "mode": "meta2diff", // generation mode we specified
  "message": "Done!",
  "input": "[BOS]<age_group2diff2age_group><disease2diff2disease><tissue>lung </tissue><cell></cell><efo>EFO_0000768 </efo><datatype></datatype><drug></drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>", // actual input prompt for the model
  "random_seed": 137
}
```
</details>

_**Example 2: generate an aging signature**_
<details>
  <summary style="font-weight:600">Details</summary>
  
Now, let's generate a signature for the whole blood of a healthy male human in the 70-90 years age group.
Note, here we expect to generate the signatures for a healthy human, that's why we set ```efo``` to an empty string "".

```json
{
    "inputs": {
        "instruction": ["age_group2diff2age_group"],
        "tissue": ["whole blood"],
        "age": "",
        "cell": "",
        "efo": "",
        "datatype": "", "drug": "", "dose": "", "time": "", "case": "40.0-50.0", "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [],
        "down": []
    },
    "mode": "meta2diff",
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.2,
        "top_k": 3550,
        "n_next_tokens": 50,
        "random_seed": 137
    }
}

```

P3GPT's output:
```json
{
  "output": {
    "up": [["IER3", "APOC2", "EDNRB", "JAKMIP2", "BACE2", ... ]],
    "down": [["TBL1Y", "TDP1", "PLPP4", "CPEB1", "ITPR3", ... ]] 
  },
  "mode": "meta2diff",
  "message": "Done!",
  "input": "[BOS]<age_group2diff2age_group><tissue>whole blood </tissue><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case>40.0-50.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>",
  "random_seed": 137
}
```
</details>
---

## Multi-Modality
By default, all tasks with a signature in the input prompt are executed with multimodal features. For each gene in the up-/down- lists, P3GPT pulls the embeddings from the Knowledge Graph and Text neural modelity mappers. Then, the embeddings are averaged to obtain one embedding for each modality and each gene list (4 averaged embeddings in total).

## Cite this model
Please, cite the following bioRxiv pre-print if you use P3GPT in your research papers or other published materials:

```
@article {Galkin2024.07.25.605062,
	author = {Galkin, Fedor and Naumov, Vladimir and Pushkov, Stefan and Sidorenko, Denis and Urban, Anatoly and Zagirova, Diana and Alawi, Khadija M and Aliper, Alex and Gumerov, Ruslan and Kalashnikov, Aleksand and Mukba, Sabina and Pogorelskaya, Aleksandra and Ren, Feng and Shneyderman, Anastasia and Tang, Qiuqiong and Xiao, Deyong and Tyshkovskiy, Alexander and Ying, Kejun and Gladyshev, Vadim N. and Zhavoronkov, Alex},
	title = {Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery},
	elocation-id = {2024.07.25.605062},
	year = {2024},
	doi = {10.1101/2024.07.25.605062},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {We present a multimodal multi-species multi-omics multi-tissue transformer for aging research and drug discovery capable of performing multiple tasks such as age prediction across species, target discovery, tissue, sex, and disease sample classification, drug sensitivity prediction, replication of omics response and prediction of biological and phenotypic response to compound treatment. This model combines textual, tabular, and knowledge graph-derived representations of biological experiments to provide insights into molecular-level biological processes. We demonstrate that P3GPT has developed an intuition for the interactions between compounds, pathologies, and gene regulation in the context of multiple species and tissues. In these areas, it outperforms existing LLMs and we highlight its utility in diverse case studies. P3GPT is a general model that may be used as a target identification tool, aging clock, digital laboratory, and scientific assistant. The model is intended as a community resource available open source as well as via a Discord server.Competing Interest StatementThe authors are affiliated with Insilico Medicine, a commercial company developing and using generative artificial intelligence and other next-generation AI technologies and robotics for drug discovery, drug development, and aging research. Utilizing its generative AI platform and a range of deep aging clocks, Insilico Medicine has developed a portfolio of multiple therapeutic programs targeting fibrotic diseases, cancer, immunological diseases, and a range of age-related diseases.},
	URL = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062},
	eprint = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062.full.pdf},
	journal = {bioRxiv}
}

```