Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
Model Overview
This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
Model Details
Base Model: microsoft/biogpt
Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
Quantization: FP16 (half-precision) using PyTorch's .half() method
Model Type: Causal Language Model
Language: English
Intended Use
This model is intended for:
- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
Out of Scope
- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
- Non-English text processing (not tested on other languages).
- Tasks requiring high precision in generative output without human oversight.
Training Details
Dataset
The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
text: The content of the excerpt. book: The name of the book (e.g., "Gray's Anatomy").
Dataset Splits:
- Original split: train (15,970 rows).
- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
Training Procedure
Preprocessing:
- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
- Set max_length=512, with truncation and padding.
- Used input_ids as labels for causal language modeling.
Fine-Tuning:
- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
Training arguments:
Epochs: 1
Batch size: 4 per device
Learning rate: 2e-5
Mixed precision: FP16 (fp16=True)
Evaluation strategy: Steps (every 1000 steps)
Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
Validation loss decreased from 2.7317 to 2.6512.
Quantization:
- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
- Saved as ./biogpt_finetuned/final_model_fp16.
- Compute Infrastructure
- Hardware: 12 GB GPU (NVIDIA)
- Environment: Jupyter Notebook on Windows
- Framework: PyTorch, Hugging Face Transformers
- Training Time: Approximately 27 minutes for 1 epoch
Evaluation
Metrics
Training Loss: Decreased from 2.8409 to 2.7006.
Validation Loss: Decreased from 2.7317 to 2.6512.
Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
Qualitative Testing
Generative Task: Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..." Keyword Extraction: Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
Usage
Installation
- Ensure you have the required libraries installed:
pip install transformers torch datasets sacremoses
Loading the Model
- Load the quantized FP16 model and tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
Example 1: Generative Inference
Generate text with the quantized model:
input_text = "What is anatomy?"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_length=50)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
Example 2: Keyword Extraction
from datasets import load_from_disk
original_datasets = load_from_disk('path/to/original_medical_textbooks')
def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
dataset = original_datasets[dataset_split]
matching_lines = []
for entry in dataset:
text = entry['text']
book = entry['book']
lines = text.split('\n')
for line in lines:
if keyword.lower() in line.lower():
matching_lines.append({'text': line.strip(), 'book': book})
if len(matching_lines) >= max_results:
return matching_lines
return matching_lines
keyword = "anatomy"
matching_lines = extract_lines_with_keyword(keyword)
for i, match in enumerate(matching_lines, 1):
print(f"{i}. Text: {match['text']}")
print(f" Book: {match['book']}\n")
Limitations
- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
- Generative Quality: Generative outputs may require human oversight for correctness.
- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.
- Downloads last month
- 62