MedEmbed: Specialized Embedding Model for Medical and Clinical Information Retrieval
Model Description
MedEmbed is a family of embedding models fine-tuned specifically for medical and clinical data, designed to enhance performance in healthcare-related natural language processing (NLP) tasks, particularly information retrieval.
GitHub Repo: https://github.com/abhinand5/MedEmbed
Technical Blog Post: https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir
Intended Use
This model is intended for use in medical and clinical contexts to improve information retrieval, question answering, and semantic search tasks. It can be integrated into healthcare systems, research tools, and medical literature databases to enhance search capabilities and information access.
Training Data
The model was trained using a simple yet effective synthetic data generation pipeline:
- Source: Clinical notes from PubMed Central (PMC)
- Processing: LLaMA 3.1 70B model used to generate query-response pairs
- Augmentation: Negative sampling for challenging examples
- Format: Triplets (query, positive response, negative response) for contrastive learning
Performance
MedEmbed consistently outperforms general-purpose embedding models across various medical NLP benchmarks:
- ArguAna
- MedicalQARetrieval
- NFCorpus
- PublicHealthQA
- TRECCOVID
Specific performance metrics (nDCG, MAP, Recall, Precision, MRR) are available in the full documentation.
Limitations
While highly effective for medical and clinical data, this model may not generalize well to non-medical domains. It should be used with caution in general-purpose NLP tasks.
Ethical Considerations
Users should be aware of potential biases in medical data and the ethical implications of AI in healthcare. This model should be used as a tool to assist, not replace, human expertise in medical decision-making.
Citation
If you use this model in your research, please cite:
@software{balachandran2024medembed,
author = {Balachandran, Abhinand},
title = {MedEmbed: Medical-Focused Embedding Models},
year = {2024},
url = {https://github.com/abhinand5/MedEmbed}
}
For more detailed information, visit our GitHub repository.
- Downloads last month
- 7,812
Model tree for abhinand/MedEmbed-small-v0.1
Base model
BAAI/bge-small-en-v1.5Collection including abhinand/MedEmbed-small-v0.1
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported72.174
- ap on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported21.758
- ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported21.758
- f1 on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported59.803
- f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported77.376
- main_score on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported72.174
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported71.284
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported33.514
- ap_weighted on MTEB AmazonCounterfactualClassification (en)test set self-reported33.514
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported65.078