Model Details
Model Name: NumericBERT
Model Type: Transformer
Architecture: BERT
Training Method: Masked Language Modeling (MLM)
Training Data: MIMIC IV Lab values data
Training Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 5e-5
- Masking Rate: 20%
- Tokenization: Custom numeric-to-text mapping using the TextEncoder class
Text Encoding Process
Overview: Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters.
Normalization and Binning:
- Method: Log normalization and splitting into 10 bins.
- Representation: Each bin is represented by a letter (A-J).
Token Construction:
- Format:
<<lab_value_bin>>
- Example: For a lab value with a normalized value in bin 'C', the token might be
C
. - Columns Used: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'.
Training Data Preprocessing
- Column Selection: Numerical values from selected lab values.
- Text Encoding: Numeric values are encoded into text using the process described above.
- Masking: 20% of the data is randomly masked during training.
Model Output
- Description: Outputs predictions for masked values during training.
- Format: Contains the encoded text representing the predicted lab values.
Limitations and Considerations
- Numeric Data Representation: The custom text representation may have limitations in capturing the intricacies of the original numeric data.
- Training Data Source: Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset.
- Generalizability: The model's effectiveness outside the context of the training dataset is not guaranteed.
Contact Information
- Email: davidres@mit.edu
- Name: David Restrepo
- Affiliation: MIT Critical Data - MIT