Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Details
|
2 |
+
|
3 |
+
#### Model Name: NumericBERT
|
4 |
+
|
5 |
+
#### Model Type: Transformer
|
6 |
+
|
7 |
+
#### Architecture: BERT
|
8 |
+
|
9 |
+
#### Training Method: Masked Language Modeling (MLM)
|
10 |
+
|
11 |
+
#### Training Data: MIMIC IV Lab values data
|
12 |
+
|
13 |
+
#### Training Hyperparameters:
|
14 |
+
|
15 |
+
- **Optimizer:** AdamW
|
16 |
+
- **Learning Rate:** 5e-5
|
17 |
+
- **Masking Rate:** 20%
|
18 |
+
- **Tokenization:** Custom numeric-to-text mapping using the TextEncoder class
|
19 |
+
|
20 |
+
### Text Encoding Process
|
21 |
+
|
22 |
+
**Overview:** Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters.
|
23 |
+
|
24 |
+
**Normalization and Binning:**
|
25 |
+
- **Method:** Log normalization and splitting into 10 bins.
|
26 |
+
- **Representation:** Each bin is represented by a letter (A-J).
|
27 |
+
|
28 |
+
### Token Construction:
|
29 |
+
|
30 |
+
- **Format:** `<<lab_value_bin>>`
|
31 |
+
- **Example:** For a lab value with a normalized value in bin 'C', the token might be `C`.
|
32 |
+
- **Columns Used:** 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'.
|
33 |
+
|
34 |
+
### Training Data Preprocessing
|
35 |
+
|
36 |
+
- **Column Selection:** Numerical values from selected lab values.
|
37 |
+
- **Text Encoding:** Numeric values are encoded into text using the process described above.
|
38 |
+
- **Masking:** 20% of the data is randomly masked during training.
|
39 |
+
|
40 |
+
### Model Output
|
41 |
+
|
42 |
+
- **Description:** Outputs predictions for masked values during training.
|
43 |
+
- **Format:** Contains the encoded text representing the predicted lab values.
|
44 |
+
|
45 |
+
### Limitations and Considerations
|
46 |
+
|
47 |
+
- **Numeric Data Representation:** The custom text representation may have limitations in capturing the intricacies of the original numeric data.
|
48 |
+
- **Training Data Source:** Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset.
|
49 |
+
- **Generalizability:** The model's effectiveness outside the context of the training dataset is not guaranteed.
|
50 |
+
|
51 |
+
### Contact Information
|
52 |
+
|
53 |
+
- **Email:** davidres@mit.edu
|
54 |
+
- **Name:** David Restrepo
|
55 |
+
- **Affiliation:** MIT Critical Data - MIT
|