# Model Details #### Model Name: NumericBERT #### Model Type: Transformer #### Architecture: BERT #### Training Method: Masked Language Modeling (MLM) #### Training Data: MIMIC IV Lab values data #### Training Hyperparameters: - **Optimizer:** AdamW - **Learning Rate:** 5e-5 - **Masking Rate:** 20% - **Tokenization:** Custom numeric-to-text mapping using the TextEncoder class ### Text Encoding Process **Overview:** Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters. **Normalization and Binning:** - **Method:** Log normalization and splitting into 10 bins. - **Representation:** Each bin is represented by a letter (A-J). ### Token Construction: - **Format:** `<>` - **Example:** For a lab value with a normalized value in bin 'C', the token might be `C`. - **Columns Used:** 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'. ### Training Data Preprocessing - **Column Selection:** Numerical values from selected lab values. - **Text Encoding:** Numeric values are encoded into text using the process described above. - **Masking:** 20% of the data is randomly masked during training. ### Model Output - **Description:** Outputs predictions for masked values during training. - **Format:** Contains the encoded text representing the predicted lab values. ### Limitations and Considerations - **Numeric Data Representation:** The custom text representation may have limitations in capturing the intricacies of the original numeric data. - **Training Data Source:** Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset. - **Generalizability:** The model's effectiveness outside the context of the training dataset is not guaranteed. ### Contact Information - **Email:** davidres@mit.edu - **Name:** David Restrepo - **Affiliation:** MIT Critical Data - MIT