File size: 11,006 Bytes
bdaf98c
 
1bae770
 
 
 
69177d6
 
 
 
 
a1a89f1
 
 
bdaf98c
df006ca
bdaf98c
f10fbc2
 
bdaf98c
 
 
13d58f6
a1a89f1
 
 
f10fbc2
a1a89f1
 
 
 
 
 
 
f10fbc2
 
a1a89f1
13d58f6
3752382
4b32dec
3752382
bdaf98c
 
 
4b32dec
bdaf98c
 
 
13d58f6
bdaf98c
 
 
4b32dec
bdaf98c
3752382
 
 
 
 
bdaf98c
c618345
 
 
f10fbc2
 
 
 
c618345
f10fbc2
 
 
c618345
f10fbc2
 
 
 
53975be
f10fbc2
 
 
53975be
 
 
 
 
 
 
f10fbc2
 
 
53975be
 
c618345
f10fbc2
0c8d56b
 
 
 
 
 
 
 
 
 
 
 
 
c618345
13d58f6
 
a1a89f1
13d58f6
f10fbc2
 
 
 
 
 
 
 
 
 
 
13d58f6
 
 
f10fbc2
 
 
 
 
 
 
 
 
 
 
c618345
bdaf98c
 
 
 
3752382
 
bdaf98c
 
 
4b32dec
 
bdaf98c
 
 
 
13d58f6
4b32dec
 
 
f10fbc2
 
 
ca631e1
 
bdaf98c
 
ca631e1
 
 
 
bdaf98c
 
 
 
 
4b32dec
bdaf98c
 
 
f10fbc2
bdaf98c
 
 
 
 
3752382
 
 
f10fbc2
 
 
 
 
bdaf98c
 
 
 
 
4b32dec
bdaf98c
 
 
a1a89f1
bdaf98c
13d58f6
 
f10fbc2
13d58f6
 
 
 
f10fbc2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
library_name: transformers
language:
  - en
base_model: microsoft/phi-2
pipeline_tag: text-generation
tags: 
- medical
- pubmed
- clinical trials
- scientific literature
widget:
 - text: "'###Unstruct:\nKawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries. Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases. However, the role of IL-41 in KD is unclear. The purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease.\nA total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. Plasma concentrations of IL-41 were determined by ELISA. Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis. Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD.\nOur results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC. Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity.\nOur study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.\n###Struct:\n"

---
![](ft_sections.png)

A small language model designed for scientific research applications. Phi2 was fine tuned to analyzing randomized clinical trial abstracts and to classify sentences into four key sections: Background, Methods, Results, and Conclusion. 
This model facilitates researchers in understanding and organizing key information from clinical studies.

## Model Details


The publication rate of Randomized Controlled Trials (RCTs) is consistently increasing,
with more than 1 million RCTs already published. 
Approximately half of these publications are listed in PubMed,
posing a significant data-volume challenge for medical researchers seeking specific information.

When searching for prior studies, such as for writing systematic reviews, 
researchers often skim through abstracts to quickly determine if the papers meet their criteria of interest. 
This task is facilitated when abstracts are structured, meaning the text within an abstract is organized under semantic headings 
like objective, method, result, and conclusion.
However, more than half of the RCT abstracts published are unstructured, complicating the rapid identification of relevant information.

This model classifies each sentence of an abstract into a corresponding 'canonical 'section, greatly accelerating the process of locating the desired information. 
This classification not only aids researchers but may also benefit other downstream applications, including automatic text summarization, information extraction, and information retrieval.


- **Developed by: Salvatore Saporito
- **Language(s) (NLP):** English
- **Finetuned from model:** https://huggingface.co/microsoft/phi-2

### Model Sources [optional]

- **Repository:** Coming soon

## Uses

Automatic identification of sections in (randomized clinical trial) abstracts.

## How to Get Started with the Model

Prompt Format:

    '''
    ###Unstruct:
    {abstract}
    ###Struct:
    '''


Usage:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers import BitsAndBytesConfig
    from peft import PeftModel

    #Load base model weight
    tokenizer_name = "microsoft/phi-2"
    basemodel_name = "microsoft/phi-2"
    model_id = "SaborDay/Phi2_RCT1M-ft-heading"
    
    #Load base model weight & tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name,trust_remote_code=True)
  
    model = AutoModelForCausalLM.from_pretrained(basemodel_name, device_map='auto', trust_remote_code=True)
    
    #Load adapter
    fine_tuned_model = PeftModel.from_pretrained(model, model_id)
    
    # Tokenize
    inputs = tokenizer(prompt, 
                   return_tensors="pt",
                   return_attention_mask=True,
                   padding=False, 
                   truncation=True)
    #Run inference    
    outputs = fine_tuned_model.generate(**inputs, max_length=1000)
    
    # Decode output
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(text)
    
Usage (with quantization):

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
                                    bnb_4bit_quant_type='nf4',
                                    bnb_4bit_compute_dtype=torch.bfloat16,
                                    bnb_4bit_use_double_quant=True)
    [...]

    model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config)

    [...]
    
    fine_tuned_model = PeftModel.from_pretrained(... , quantization_config=bnb_config)


Example:
Application on unseen data 

        PROMPT: '###Unstruct:\nKawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries.
        Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases. 
        However, the role of IL-41 in KD is unclear.
        The purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease.
        A total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. Plasma concentrations of IL-41 were determined by ELISA. 
        Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis.
        Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD.
        Our results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC. 
        Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity.
        Our study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.
        ###Struct:'

        Answer Phi2_RCT1M-ft-heading:

        BACKGROUND: Kawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries. 
        Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases. 
        However, the role of IL-41 in KD is unclear. T
        he purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease. 
        METHODS: A total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. 
        Plasma concentrations of IL-41 were determined by ELISA.
        Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis.
        Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD. 
        RESULTS: Our results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC.
        Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity. 
        CONCLUSIONS: Our study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.
    
## Training Details

### Training Data

50k randomly sampled randomized clinical trial abstracts with date of pubblication within [1970-2023].
Abstracts were retrieved from MEDLINE using Biopython.

### Training Procedure 

Generation of (unstructured, structured) pairs for structured abstracts.
Generation of dedicated prompt for Causal_LM modelling.


#### Training Hyperparameters

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_compute_dtype=torch.bfloat16,
                                bnb_4bit_use_double_quant=True)

#### Training Run metrics

    
    [Run details on WaB](https://wandb.ai/salvatore-saporito-phd/huggingface/runs/5fcnxthk?nw=nwusersalvatoresaporitophd)
## Evaluation

The model was evaluated over a subset of previously considered abstracts [20k RCT](https://github.com/Franck-Dernoncourt/pubmed-rct/tree/master/PubMed_20k_RCT).

Each individual abstract within evaluation sample was verified not to be present in training set using corresponding PMID.


### Testing Data, Factors & Metrics

#### Testing Data

10k randomly sampled RCT abstract within period [1970-2023]

#### Metrics

[WIP]

## Technical Specifications [optional]

### Model Architecture and Objective

    LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=['q_proj','k_proj','v_proj','dense','fc1','fc2'], 
        bias="none",
        lora_dropout=0.05,
        task_type="CAUSAL_LM",
        )

### Compute Infrastructure

#### Hardware

1 x RTX4090 - 24 GB

#### Software

    pip install torch einops transformers bitsandbytes accelerate peft 

## Model Card Contact

Salvatore Saporito - salvatore.saporito.phd@gmail.com

## References

https://arxiv.org/abs/1710.06071
https://arxiv.org/abs/2106.09685
https://arxiv.org/pdf/2309.05463