medBERT-base

This repository contains a BERT-based model, medBERT-base, fine-tuned on the gayanin/pubmed-gastro-maskfilling dataset for the task of Masked Language Modeling (MLM). The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.

Model Architecture

Base Model: bert-base-uncased
Task: Masked Language Modeling (MLM) for medical texts
Tokenizer: BERT's WordPiece tokenizer

Usage

Loading the Pre-trained Model

You can load the pre-trained medBERT-base model using the Hugging Face transformers library:

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")

input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer."
inputs = tokenizer(input_text, return_tensors='pt').to("cuda")

outputs = model(**inputs)

masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item()

top_k = 5
logits = outputs.logits[0, masked_index]
top_k_ids = torch.topk(logits, k=top_k).indices.tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)

print("Top 5 prediction:")
for i, token in enumerate(top_k_tokens):
    print(f"{i + 1}: {token}")

Top 5 prediction: 1: from 2: of 3: after 4: by 5: through

Fine-tuning the Model

To fine-tune the medBERT-base model on your own medical dataset, follow these steps:

Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
Tokenize the dataset and apply masking.
Train the model using the provided training loop.

Here's the training code:

https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb

Training Details

Hyperparameters

Batch Size: 16
Learning Rate: 5e-5
Number of Epochs: 1
Max Sequence Length: 512 tokens

Dataset

Dataset Name: gayanin/pubmed-gastro-maskfilling
Task: Masked Language Modeling (MLM) on medical texts

Acknowledgements

The gayanin/pubmed-gastro-maskfilling dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
This model uses the Hugging Face transformers library, which is a state-of-the-art library for NLP models

suayptalha
/

medBERT-base