medBERT-base / README.md
suayptalha's picture
Update README.md
6a1ea73 verified
|
raw
history blame
3.13 kB
---
base_model:
- google-bert/bert-base-uncased
datasets:
- gayanin/pubmed-gastro-maskfilling
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: fill-mask
tags:
- math
---
![medBERT-logo](medBERT.png)
# **medBERT-base**
This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.
## **Model Architecture**
- **Base Model**: `bert-base-uncased`
- **Task**: Masked Language Modeling (MLM) for medical texts
- **Tokenizer**: BERT's WordPiece tokenizer
## **Usage**
### **Loading the Pre-trained Model**
You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:
```py
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")
input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer."
inputs = tokenizer(input_text, return_tensors='pt').to("cuda")
outputs = model(**inputs)
masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item()
top_k = 5
logits = outputs.logits[0, masked_index]
top_k_ids = torch.topk(logits, k=top_k).indices.tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)
print("Top 5 prediction:")
for i, token in enumerate(top_k_tokens):
print(f"{i + 1}: {token}")
```
_Top 5 prediction:_
_1: from_
_2: of_
_3: after_
_4: by_
_5: through_
### **Fine-tuning the Model**
To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:
1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
2. Tokenize the dataset and apply masking.
3. Train the model using the provided training loop.
Here's the training code:
https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb
## **Training Details**
### **Hyperparameters**
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Number of Epochs**: 1
- **Max Sequence Length**: 512 tokens
### **Dataset**
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
- **Task**: Masked Language Modeling (MLM) on medical texts
## **Acknowledgements**
- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models
<h3 align="left">Support:</h3>
<p><a href="https://www.buymeacoffee.com/suayptalha"> <img align="left" src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" height="50" width="210" alt="suayptalha" /></a></p><br><br>