shainaraza/clinical-bias-ner

Clinical Bias NER Model

This is a Named Entity Recognition (NER) model trained on clinical text data to detect biased language. The model identifies named entities in text, specifically mentions of patient groups and conditions, and marks them as potentially biased.

Examples

Below are some examples of text that you can use to test the model:

[
    {
        "input_text": "The patient is a 50-year-old man, poor looks and malnourished.",
        "expected_output": [
            {
                "word": "poor",
                "label": "BIAS"
            },
            {
                "word": "malnourished",
                "label": "BIAS"
            }
        ]
    },
    {
        "input_text": "The patient is a 50-year poor, take drugs and has aggressive behavior.",
        "expected_output": [
            {
                "word": "poor",
                "label": "BIAS"
            },
            {
                "word": "aggressive",
                "label": "BIAS"
            }
        ]
    }
]

Model Details

The model was trained on the clinical notes dataset using the distilbert-base-uncased transformer model. It was fine-tuned for 3 epochs using a batch size of 8 on Google Colab.

The model is capable of identifying named entities with two labels - O (for non-biased words) and BIAS (for potentially biased words). The BIAS label is annotated manually by looking into each record and finding which sentence has bias.

Performance

The model achieved an F1-score of 0.93 on the validation set of the dataset.

Usage

The model can be used to identify potentially biased language in clinical text data. It can be integrated into a larger NLP pipeline or used as a standalone tool.

To use the model, simply import the AutoModelForTokenClassification and AutoTokenizer classes from the transformers library, and load the model and tokenizer using the from_pretrained() method.

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from prettytable import PrettyTable

# Load the model and tokenizer from the Hugging Face model hub
model_name = "shainaraza/clinical-bias-ner"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define the text to classify
text = "The patient is a 50-year poor, take drugs and has aggressive behavior."

# Tokenize the text
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text)))

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Generate attention masks
attention_masks = [1] * len(input_ids)

# Prepare the input tensors
input_ids = torch.tensor(input_ids).unsqueeze(0)
attention_masks = torch.tensor(attention_masks).unsqueeze(0)

# Run the model and get the predicted labels
with torch.no_grad():
    outputs = model(input_ids, attention_masks)
    predicted_labels = torch.argmax(outputs[0], dim=2)

# Convert predicted labels back to text
predicted_labels = predicted_labels.squeeze().tolist()
predicted_labels = [model.config.id2label[label_id] for label_id in predicted_labels]
predicted_text = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
predicted_text_with_labels = ""
for i, token in enumerate(tokens):
    predicted_text_with_labels += f"{token}/{predicted_labels[i]} "

# Display the predicted labels in a table
table = PrettyTable(['Token', 'Label'])
for i, token in enumerate(tokens):
    table.add_row([token, predicted_labels[i]])
print(predicted_text)
print(table)

This will output

+------------+-------+
|   Token    | Label |
+------------+-------+
|   [CLS]    |   O   |
|   [UNK]    |   O   |
|  patient   |   O   |
|     is     |   O   |
|     a      |   O   |
|     50     |   O   |
|     -      |   O   |
|    year    |   O   |
|    poor    |  BIAS |
|     ,      |   O   |
|    take    |   O   |
|   drugs    |   O   |
|    and     |   O   |
|    has     |   O   |
| aggressive |  BIAS |
|  behavior  |   O   |
|     .      |   O   |
|   [SEP]    |   O   |
+------------+-------+

Limitations and Future Work

The model is not perfect and may not capture all instances of biased language. It is important to note that the model only identifies potentially biased language and does not make any judgments on intent or impact.

In future work, the model could be fine-tuned on a larger and more diverse dataset to improve its performance. Additionally, the model could be extended to identify other types of biased language, such as ageism, racism, or sexism.

Acknowledgments

This model was developed by Shaina Raza as part of her project.

Contact

For any questions or comments, please contact shaina.raza@torontomu.ca

shainaraza
/

clinical-bias-ner

You need to agree to share your contact information to access this model