vnaumov's picture
Update README.md
22a6dee verified
|
raw
history blame
No virus
7.77 kB
metadata
license: cc-by-nc-4.0

Precious3GPT-Multi-Modal

A multi-modal multi-omics multi-species language model.

  • Developer: Insilico Medicine
  • License: cc-by-nc-4.0
  • Model size: 89.4 million parameters
  • Domain: Biomedical
  • Base architecture: MPT

Run model using endpoint step by step

Step 1 - connect to endpoint


import requests

API_URL = "https://cu2s6lgb4jew3tht.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
    "Accept" : "application/json",
    "Authorization": "Bearer hf_XXXX",
    "Content-Type": "application/json" 
}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

Step 2 - create input for endpoint

import json
with open('./generation-configs/meta2diff.json', 'r') as f:
    config_data = json.load(f)

# prepare request configuration
request_config = {"inputs": config_data, "mode": "meta2diff", "parameters": {
    "temperature": 0.8,
    "top_p": 0.2,
    "top_k": 3550,
    "n_next_tokens": 50,
    "random_seed": 137
}}

How Precisou3GPT will see given request

[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>

Step 3 - send request to endpoint

output = query(request_config)

Endpoint output structure

{
    "output": {
        "up": List, 
        "down": List
    },
    "mode": String, // Generation mode was selected
    "message": "Done!",  // or Error
    "input": String // Input prompt was passed

}

Note: If the mode was supposed to generate compounds, the output would contain compounds: List.


Run model locally

Details

Requirements: torch==2.0.1 einops==0.7.0 huggingface-hub==0.20.1 transformers==4.35.0

  1. Download the repository https://huggingface.co/insilicomedicine/precious3-gpt-multi-modal

  2. Inside the repository execute:


# init handler
from handler import EndpointHandler
precious3gpt_handler = EndpointHandler(path='./')

import json
with open('./generation-configs/meta2diff.json', 'r') as f:
    config_data = json.load(f)

# prepare request configuration
request_config = {"inputs": config_data, 
                  "mode": "meta2diff", 
                  "parameters": {
    "temperature": 0.8,
    "top_p": 0.2,
    "top_k": 3550,
    "n_next_tokens": 50,
    "random_seed": 137
}}

output = precious3gpt_handler(request_config)

Precious3GPT request configuration

Generation Modes (mode in config)

Choose the appropriate mode based on your requirements:

  1. meta2diff: Generate signature (up- and down- gene lists) given meta-data such as tissue, compound, gender, etc.
  2. diff2compound: Predict compounds based on signature.
  3. meta2diff2compound: Generate signatures given meta-data and then predict compounds based on generated signatures.

Instruction (inputs.instruction in config)

  1. disease2diff2disease - generate signature for disease / predict disease based on given signature
  2. compound2diff2compound - generate signature for compound / predict compound based on given signature
  3. age_group2diff2age_group - generate signature for age group / predict age group based on signature

Other meta-data (inputs. in config)

Full list of available values for each meta-data item you can find in p3_entities_with_type.csv

Examples

In the following examples all possible configuration fields are specified. You can leave some meta-data fields in the inputs section empty string("") or empty list([]).

Example 1

If you want to generate a signature given specific meta-data you can use the following configuration. Note, up and down fields are empty lists as you want to generate them. Here we ask the model to generate a signature for a human within the age group of 70-90 years, male, in tissue - Lungs with disease EFO_0000768.

{
    "inputs": {
        "instruction": ["age_group2diff2age_group", "disease2diff2disease", "compound2diff2compound"], 
        "tissue": ["lung"],
        "age": "",
        "cell": "", 
        "efo": "EFO_0000768", 
        "datatype": "", "drug": "", "dose": "", "time": "", "case": ["70.0-80.0", "80.0-90.0"], "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [], "down": []
    }, 
    "mode": "meta2diff", 
    "parameters": {
        "temperature": 0.8, "top_p": 0.2, "top_k": 3550, "n_next_tokens": 50, "random_seed": 137
    }
}

Here is output:

{
  "output": {
    "up": [["PTGDR2", "CABYR", "MGAM", "TMED9", "SHOX2", "MAT1A", "MUC5AC", "GASK1B", "CYP1A2", "RP11-266K4.9", ...]], // generated list of up-regulated genes
    "down": [["MB", "OR10V1", "OR51H1", "GOLGA6L10", "OR6M1", "CDX4", "OR4C45", "SPRR2A", "SPDYE9", "GBX2", "ATP4B", ...]] // generated list of down-regulated genes
  },
  "mode": "meta2diff", // generation mode we specified
  "message": "Done!",
  "input": "[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><cell></cell><efo>EFO_0000768 </efo><datatype></datatype><drug></drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>", // actual input prompt for the model
  "random_seed": 137
}

Example 2

Now let's generate a signature for a healthy human within the age group of 70-90 years, male, in tissue - whole blood. Note, here we use disease2diff2disease instruction, but we expect to generate signatures for a healthy human, that's why we'd set efo to empty string "". Alternatively, for this example we can add one more instruction to example 2 - "instruction": ["disease2diff2disease", "age_group2diff2age_group"]

{
    "inputs": {
        "instruction": ["disease2diff2disease", "age_group2diff2age_group"],
        "tissue": ["whole blood"],
        "age": "",
        "cell": "",
        "efo": "",
        "datatype": "", "drug": "", "dose": "", "time": "", "case": "40.0-50.0", "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [],
        "down": []
    },
    "mode": "meta2diff",
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.2,
        "top_k": 3550,
        "n_next_tokens": 50,
        "random_seed": 137
    }
}

Here is output:

{
  "output": {
    "up": [["IER3", "APOC2", "EDNRB", "JAKMIP2", "BACE2", ... ]],
    "down": [["TBL1Y", "TDP1", "PLPP4", "CPEB1", "ITPR3", ... ]] 
  },
  "mode": "meta2diff",
  "message": "Done!",
  "input": "[BOS]<disease2diff2disease><age_group2diff2age_group><tissue>whole blood </tissue><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case>40.0-50.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>",
  "random_seed": 137
}

Multi-Modality

Applies by default in tasks where you pass a signature. For each gene in up- and down- lists the model gets embeddings from Knowledge Graph and Text NNs. Then embeddings are averaged in order to obtain one embedding for each modality for each gene list (4 averaged embeddings in total).