File size: 10,564 Bytes
a73dc5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7fdd18
a73dc5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7fdd18
a73dc5e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
from openai import OpenAI
import gradio as gr
import os

api_key = os.getenv('API_KEY')
base_url = os.getenv("BASE_URL")

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)


def extract_pdf_pypdf(pdf_dir):
    try:
        with open(pdf_dir, "r", encoding="utf-8") as f:
            file_content = f.read()
        return file_content
    except Exception as e:
        print(f"Error opening PDF: {e}")
        return None


def openai_api(messages):
    try:
        completion = client.chat.completions.create(
            model="claude-3-5-sonnet-20240620",
            messages=messages,
            temperature=0.1,
            max_tokens=8192,
            stream=True
        )
        response = ''.join(
            [chunk.choices[0].delta.content if chunk.choices[0].delta.content else "" for chunk in completion])
        return response
    except Exception as ex:
        print("API error:", ex)
        return None


def predict(input_text, pdf_file):
    if pdf_file is None:
        return "Please upload a PDF file to proceed."

    file_content = pdf_file  # extract_pdf_pypdf(pdf_file.name)
    messages = [
        {
            "role": "system",
            "content": "You are an expert in information extraction from scientific literature.",
        },
        {"role": "user", "content": """Provided Text:
    '''
    {{""" + file_content + """}}
    '''
                                        """ + input_text}
    ]
    extract_result = openai_api(messages)

    return extract_result or "Too many users. Please wait a moment!"


en_1 = """Please read the scientific article provided and extract detailed information about enzymes from a specific organism, focusing on variants or mutants. Your focus should be on data related to the enzyme's activity on substrates at specific concentrations, under certain pH levels and temperatures, and in the presence of different cofactors or cosubstrates at various concentrations. It is essential to identify and record the enzymatic kinetics parameters: Km, Kcat, and Kcat/Km values under these conditions.

Organize all this information into a table with 13 columns titled: Enzyme, Organism, Substrate, Km, Unit_Km, Kcat, Unit_Kcat, Kcat/Km, Unit_Kcat/Km, Commentary[Temp], Commentary[pH], Commentary[Mutant], and Commentary[Cosubstrate].

While performing the tasks, please pay special attention to the following points:
1. Unit retention: Unit_Km, Unit_Kcat, Unit_Kcat/Km should be recorded and output exactly as they appeared in the tables from the Scientific Article Fraction.
2. Scientific Notation: For values in the table that are derived from the article’s headers containing scientific notations, ensure that the actual values entered into the table reflect these notations accordingly. For instance, if an original table specifies 'Kcat/Km × 10^4 (M^-1s^-1)' in table header, then the value entered under 'Kcat/Km' of your table should be '1.4 × 10^4' without any unit if 1.4 was the original figure. Importantly, enter its respective unit 'M^-1s^-1' under 'Unit_Kcat/Km' in your table. Apply this method for each relevant entry, preserving the scientific notation detail as provided in the article. Conversely, for headers not involving scientific notations, simply transcribe values and units as they are, without adding or altering the notation form.
3. Pure Numbers and Units: Please ensure that all numerical values in the columns of 'Km', 'Kcat', and 'Kcat/Km' are entered as pure numbers without any accompanying units. The corresponding units must be placed in their respective 'Unit' columns only, such as 'Unit_Km', 'Unit_Kcat', and 'Unit_Kcat/Km'. This separation of values and units is critical to maintain clarity and consistency in the data representation.
4. Mean Values Only: I need you to include only the mean values, excluding standard deviations or errors, while standard deviations or errors might be indicated after '±' or be wrapped in '()'.
5. Full Forms: In the case that abbreviated or shortened forms are used in the entries of certain tables or other informative text, endeavor to trace back to the full forms of these abbreviations in the Scientific Article Fraction and reflect them in the tables you are organizing.
6. Data Derivation: All data must be derived solely from the unit conversion of the Scientific Article Fraction provided, not from any calculations. For example, do not calculate the Kcat/Km ratio by dividing perceived Kcat data by Km data; only use pre-existing Kcat/Km values from the Scientific Article Fraction.
7. Ensure that each row of the table corresponds to a unique set of conditions and their respective kinetic parameters for the enzyme being measured.


Output the table using the pipe symbol (|) as the delimiter, ensuring each entry is separated by a pipe symbol and properly aligned to maintain the structure of the table. I need you to include only the mean values, excluding standard deviations or errors, while standard deviations or errors might be indicated after '±' or be wrapped in '()'. Include all details and rows in the output, providing a comprehensive extraction of every data point without omissions. Format the complete table data clearly, ensuring that every piece of information is included and no data points are left out. Do not use ellipses or any other form of indication suggesting information is continued elsewhere. The full dataset must be provided as per the structure above, ensuring the integrity and usability of the data for subsequent analyses or applications. Present the complete table data in a clear and organized format in your response, without the need for further confirmation or prompts.

Please pay attention to the pipe format as shown in the example below. This format is for reference only regarding the structure; the content within is not the focus of this instruction.

| Enzyme     | Organism          | Substrate   | Km  | Unit_Km | Kcat | Unit_Kcat | Kcat/Km | Unit_Kcat/Km | Commentary[Temp] | Commentary[pH] | Commentary[Mutant] | Commentary[Cosubstrate] |
|------------|-------------------|-------------|-----|---------|------|-----------|---------|--------------|------------------|----------------|--------------------|-------------------------|
| Enzyme1    | Bacillus subtilis | Substrate_A | 7.3 | mM      | 6.4  | s^-1      | 1.4 × 10^4   | M^-1s^-1     | 37°C             | 5.0            | WT                 | NADP^+                  |
| Enzyme2    | Escherichia coli  | Substrate_B | 5.9 | mM      | 9.8  | s^-1      | 29000   | mM^-1min^-1  | 60°C             | 10.0           | Q176E             | NADPH                   |
| Enzyme3    | Homo sapiens      | Substrate_C | 6.9 | mM      | 15.6 | s^-1      | 43000   | µM^-1s^-1    | 65°C             | 8.0            | T253S             | NAD^+                   |

Structure your responses to allow for seamless concatenation, presenting all tabular data from a scientific article as a single table, even if the original content had multiple tables. Use the full response capacity to maximize data presentation, avoiding summarizations, commentaries, or introductions at the end of each response. The subsequent response should pick up precisely where the preceding one concluded, commencing from the following character, without the necessity to reiterate the table header or the fragmented words. This method ensures the table is presented completely and seamlessly, despite character limit constraints. Please start by outputting the first segment of the table according to these guidelines.
"""


def update_input():
    return en_1


with gr.Blocks(title="Automated Enzyme Kinetics Extractor for Markdown") as demo:
    gr.Markdown(
        '''<h1 align="center"> Automated Enzyme Kinetics Extractor for Markdown</h1>
        <p>How to use:
        <br><strong>1</strong>: Upload your markdown.
        <br><strong>2</strong>: Click "View Text" to preview it.
        <br><strong>3</strong>: Enter your extraction prompt in the input box.
        <br><strong>4</strong>: Click "Generate" to extract, and the extracted information will display below.
        </p>'''
    )
    file_input = gr.File(label="Upload your MD", type="filepath")
    example = gr.Examples(examples=[["./sample.md"]], inputs=file_input)
    with gr.Row():
        extract_button = gr.Button("View Text", variant="primary")

    with gr.Row():
        with gr.Column(scale=1):
            text_output = gr.Textbox(
                label="Extracted Text",
                interactive=True,
                placeholder="Extracted text will appear here...",
                lines=39,
                max_lines=39,  # 设置最大行数,如果超过将显示滚动条
                autoscroll=False,  # 设置自动滚动到底部
                show_copy_button=True,
                elem_id="text-output"
            )

    with gr.Column():
        model_input = gr.Textbox(lines=7, value=en_1, placeholder='Enter your extraction prompt here', label='Input Prompt')
        exp = gr.Button("Example Prompt")
        with gr.Row():
            gen = gr.Button("Generate", variant="primary")
            clr = gr.Button("Clear")
        outputs = gr.Markdown(label='Output', value="""| Enzyme     | Organism          | Substrate   | Km  | Unit_Km | Kcat | Unit_Kcat | Kcat/Km | Unit_Kcat/Km | Commentary[Temp] | Commentary[pH] | Commentary[Mutant] | Commentary[Cosubstrate] |
|------------|-------------------|-------------|-----|---------|------|-----------|---------|--------------|------------------|----------------|--------------------|-------------------------|
| Enzyme1    | Bacillus subtilis | Substrate_A | 7.3 | mM      | 6.4  | s^-1      | 1.4 × 10^4   | M^-1s^-1     | 37°C             | 5.0            | WT                 | NADP^+                  |
| Enzyme2    | Escherichia coli  | Substrate_B | 5.9 | mM      | 9.8  | s^-1      | 29000   | mM^-1min^-1  | 60°C             | 10.0           | Q176E             | NADPH                   |
| Enzyme3    | Homo sapiens      | Substrate_C | 6.9 | mM      | 15.6 | s^-1      | 43000   | µM^-1s^-1    | 65°C             | 8.0            | T253S             | NAD^+                   |

""")
    extract_button.click(extract_pdf_pypdf, inputs=file_input, outputs=text_output)
    exp.click(update_input, outputs=model_input)
    gen.click(fn=predict, inputs=[model_input, text_output], outputs=outputs)
    clr.click(fn=lambda: [gr.update(value=""), gr.update(value="")], inputs=None, outputs=[model_input, outputs])


demo.launch()