Biomedical NLP papers - a FremyCompany Collection

Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data

Paper • 2412.00230 • Published about 1 month ago • 1

Note In this study, we investigate the various types of domain proxies used as substitutes for authentic clinical documents, including machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, as well as other types of common proxies such as journal publications.

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Paper • 2411.15640 • Published Nov 23 • 4

Note In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias.

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Paper • 2411.06469 • Published Nov 10 • 17

Note In this paper, we introduce a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models.

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Paper • 2411.04118 • Published Nov 6 • 1

Note In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks.

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Paper • 2411.03590 • Published Nov 6 • 9

Note In this work, we evaluate the o1-preview model across various medical benchmarks. Without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We however found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning models.

MedINST: Meta Dataset of Biomedical Instructions

Paper • 2410.13458 • Published Oct 17 • 6

Note In this paper, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability.

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

Paper • 2410.10626 • Published Oct 14 • 37

Note In this work, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing. Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence.

CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures

Paper • 2410.05235 • Published Oct 7 • 2

Note In this paper, we present a multilingual dataset for Medical QA where correct and incorrect diagnoses for a clinical case are enriched with a texual explanation written by doctors and manually annotated, resulting in a dataset of 558 clinical cases in four languages with 5k claims, 2k premises, 2k support relations, and 1k attack relations.

Named Clinical Entity Recognition Benchmark

Paper • 2410.05046 • Published Oct 7 • 17

Note This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare. The leaderboard provides a standardized platform for assessing diverse language models on their ability to identify and classify clinical entities across multiple medical domains. These entities are standardized according to the OMOP data model.

Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis

Paper • 2410.03908 • Published Oct 4

Note In this study, we introduce a benchmark for depression-anxiety comorbidity classification from social media posts comprising 2876 meticulously annotated posts by expert psychologists and 7667 silver-labeled posts. ANGST uses multi-label classification, allowing each post to be simultaneously identified as indicating depression and/or anxiety.

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Paper • 2410.02458 • Published Oct 3 • 9

Note This study explores enhancing Vision Transformers for medical image segmentation by integrating pre-trained LLM transformer blocks. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales.

Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation

Paper • 2410.00163 • Published Sep 30

Note This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic-100k-en and MedQuAD datasets, translated from English using GPT-3.5, were used to fine-tune the ChatBode-7B model using the PEFT-QLoRA method.

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Paper • 2409.19492 • Published Sep 29

Note In this work, we propose a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics and the corresponding hallucinated responses from LLMs with labeled hallucination types and hallucinated text spans. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations.

INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large Language Models and Ensemble Learning

Paper • 2409.19467 • Published Sep 28

Note In this work, we investigate state-of-the-art LLMs in text mining tasks on medications and their related attributes such as dosage, route, strength, and adverse effects. In addition, we explore different ensemble learning methods (Stack-Ensemble and Voting-Ensemble) to augment the model performances from individual LLMs.

Efficient and Personalized Mobile Health Event Prediction via Small Language Models

Paper • 2409.18987 • Published Sep 17

Note This paper examines the capability of SLMs to accurately analyze health data, such as steps, calories, sleep minutes, and other vital statistics, to assess an individual's health status. Our results indicate that SLMs could potentially be deployed on wearable or mobile devices for real-time health monitoring.

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Paper • 2409.15277 • Published Sep 23 • 34

Note This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes.

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

Paper • 2409.14988 • Published Sep 23 • 21

Note In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. Our evaluation across various clinical tasks reveals the impact of each technique.

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Paper • 2409.14074 • Published Sep 21

Note In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. Secondly, we establish the empirical baselines.

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Paper • 2409.13317 • Published Sep 20

Note In this work, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available.

DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction

Paper • 2409.10504 • Published Sep 16 • 1

Note In this work, we propose a mechanistic interpretability module that disentangles dense embeddings into a sparse embedding space, where nonzero elements represent globally learned medical concepts. Our LLM-based feature identification pipeline uncovered these concepts by summarizing the highest activating tokens by feature.

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Paper • 2409.07314 • Published Sep 11 • 50

Note In this work, we introduce MEDIC, a framework assessing LLMs across 5 competences: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs.

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Paper • 2408.13833 • Published Aug 25

Note In this study, we evaluated their performance on clinical case challenges from biomedical journals and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). We found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on knowledge.

MultiMed: Massively Multimodal and Multitask Medical Understanding

Paper • 2408.12682 • Published Aug 22

Note In this work, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks.

Towards Evaluating and Building Versatile Large Language Models for Medicine

Paper • 2408.12547 • Published Aug 22

Note In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of LLMs in clinical contexts, spanning 11 clinical tasks. We also developed MedS-Ins, a large-scale instruction tuning dataset for medicine which comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.

RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions

Paper • 2408.08624 • Published Aug 16

Note In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions.

Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering

Paper • 2408.07888 • Published Aug 15 • 11

Note In this study, we extend previous research by evaluating both curriculum-based and non-curriculum-based learning strategies across multiple LLMs, using human-defined and automated data labels for medical question answering. Our results indicate a moderate impact of using human-inspired learning strategies for fine-tuning LLMs.

Med42-v2: A Suite of Clinical LLMs

Paper • 2408.06142 • Published Aug 12 • 50

Note Med42-v2 introduces a suite of clinical LLMs designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to respond well to natural prompts, understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments.

Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation

Paper • 2408.04187 • Published Aug 8 • 3

Note In this paper, we introduce a novel graph-based RAG framework for the medical domain. Entities extracted from carefully-chunked publications are used to create a 3-tier hierarchical graph structure, linking entities to foundational medical knowledge. They are then interconnected by similarity to form meta-graphs, used with the U-retrieve approach.

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

Paper • 2408.02600 • Published Aug 5 • 8

Note In this paper, we present BioMamba, a pre-trained model specifically designed for biomedical text mining. BioMamba builds upon the Mamba architecture and is pre-trained on an extensive corpus of biomedical literature. Our empirical studies demonstrate that BioMamba significantly outperforms existing models like BioBERT across various biomedical tasks.

MedSyn: LLM-based Synthetic Medical Text Generation Framework

Paper • 2408.02056 • Published Aug 4 • 1

Note In this study, we introduce MedSyn, a framework that combines large language models with a Medical Knowledge Graph (MKG) to generate synthetic medical text. We use MKG to sample prior medical information for prompts and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the effectiveness of synthetic data by applying it to the ICD code prediction task.

BioRAG: A RAG-LLM Framework for Biological Question Reasoning

Paper • 2408.01107 • Published Aug 2

Note In this paper, we introduce BioRAG, a novel RAG framework for biological question reasoning. We process 22 million scientific papers*to build a comprehensive knowledge base and train a specialized embedding model. Additionally, we enhance vector retrieval with a domain-specific knowledge hierarchy and iterative retrieval for up-to-date information.

Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions

Paper • 2408.00727 • Published Aug 1 • 1

Note In this article, we propose i-MedRAG, an iterative RAG system for medical question-answering. i-MedRAG enhances traditional RAG by allowing large language models to iteratively ask follow-up queries based on previous attempts. Our experiments demonstrate that i-MedRAG significantly improves performance on complex medical questions compared to vanilla RAG.

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Paper • 2407.13301 • Published Jul 18 • 55

Note This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making.

LLMs-in-the-loop Part-1: Expert Small AI Models for Bio-Medical Text Translation

Paper • 2407.12126 • Published Jul 16 • 52

Note This study introduces a novel "LLMs-in-the-loop" approach to develop supervised neural machine translation models optimized specifically for medical texts. While LLMs have demonstrated powerful capabilities, this research shows that small, specialized models trained on high-quality in-domain (mostly synthetic) data can outperform even vastly larger LLMs.

Panacea: A foundation model for clinical trial search, summarization, design, and recruitment

Paper • 2407.11007 • Published Jun 25

Note In this work, we propose a clinical trial foundation model named Panacea, designed to handle multiple tasks, including trial search, trial summarization, trial design, and patient-trial matching. We also assemble a large-scale dataset, named TrialAlign, to infuse clinical knowledge during pre-training, and TrialInstruct, 200k instructions for fine-tuning.

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Paper • 2407.08940 • Published Jul 12

Note In this study, we construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into a training set, a seen test, and unseen test set (based on publication date). Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.

CLIMB: A Benchmark of Clinical Bias in Large Language Models

Paper • 2407.05250 • Published Jul 7 • 2

Note This paper introduces CLIMB, a pioneering comprehensive benchmark to evaluate both intrinsic and extrinsic bias in LLMs for clinical decision tasks. Notably, for intrinsic bias, we introduce a novel metric, AssocMAD, to assess the disparities of LLMs across multiple demographic groups. We leverage counterfactual intervention to evaluate extrinsic bias.

How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions

Paper • 2407.05015 • Published Jul 6 • 4

Note This paper introduces a biomedical RAG system designed to enhance the reliability of generated responses. The system is based on a LLM fine-tuned to provide references for each output statement, allowing the users to verify the answer, where retrieved relevant abstracts from PubMed are passed to LLM's context

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

Paper • 2407.04106 • Published Jul 4

Note This article introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. The model supports numerous modalities (CT scans, MRIs, ...) and tasks (medical report generation, visual QA, and disease identification).

BioMNER: A Dataset for Biomedical Method Entity Recognition

Paper • 2406.20038 • Published Jun 28

Note In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge LLMs customised to our dataset.

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Paper • 2406.19280 • Published Jun 27 • 61

Note In this work, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision.

MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation

Paper • 2406.17484 • Published Jun 25

Note This paper introduces MedCare, an LLM designed for medical tasks which employs a two-stage fine-tuning pipeline to separate clinical alignment from knowledge aggregation. The first stage uses a Knowledge Aggregator and a Noise Aggregator to encode and filter information, while the second leverages an alignment module to prevent knowledge forgetting.

RaTEScore: A Metric for Radiology Report Generation

Paper • 2406.16845 • Published Jun 24 • 4

Note This paper introduces a novel, entity-aware metric, termed as Radiological Report (Text) Evaluation (RaTEScore), to assess the quality of medical reports generated by AI models. RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Paper • 2406.16611 • Published Jun 24

Note In this survey, we evaluate 53 language models for the medical domain, focusing on classification and text generation tasks. The considered models are very diverse, ranging from 110M to 13B parameters, spanning the three families of Transformer-based models and from diverse knowledge domains.

EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records

Paper • 2406.16341 • Published Jun 24 • 12

Note This paper presents a new dataset and task specifically designed to ensure data consistency between structured tables and unstructured notes in EHRs. EHRCon was crafted in collaboration with healthcare professionals using the MIMIC-III EHR dataset, and includes manual annotations of 3,943 entities across 105 clinical notes.

Real-time Speech Summarization for Medical Conversations

Paper • 2406.15888 • Published Jun 22 • 1

Note In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. We also present VietMed-Sum, a speech summarization dataset for medical conversations, and a baseline model trained on it.

Infusing clinical knowledge into tokenisers for language models

Paper • 2406.14312 • Published Jun 20

Note This study introduces K-Tokeniser, a technique that initializes the global representations of tokens based on the semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like UMLS or the training data of a task related corpus. At training or inference stage, the context is used to pick the best token representation.

Medical Spoken Named Entity Recognition

Paper • 2406.13337 • Published Jun 19

Note In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence.

Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models

Paper • 2406.12182 • Published Jun 18

Note In this paper, we propose Aquila-Med, a bilingual medical LLM based on Aquila, trained through continued pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). We construct a large-scale Chinese and English medical dataset for continue pre-training and a high-quality SFT dataset, covering extensive medical specialties.

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Paper • 2406.12066 • Published Jun 17 • 8

Note In this study, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks (MedQA and MedMCQA) before and after swapping brand and generic drug names, using physician expert annotations.

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Paper • 2406.11328 • Published Jun 17

Note In this paper, we introduce the Examinations for Medical Personnel in Chinese, a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.

CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions

Paper • 2406.09923 • Published Jun 14 • 1

Note In this work, we introduce a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases but also incorporates tasks of clinical significance.

Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing

Paper • 2406.06723 • Published Jun 10

Note In this article, we propose a novel approach for weak supervision in clinical natural language processing by leveraging fine-tuned LLMs to generate weakly-labeled data. We utilize this data to train a downstream BERT model, which is then further fine-tuned on a small set of gold standard data.

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

Paper • 2406.06573 • Published Jun 3 • 9

Note In this study, we introduce an adversarial method designed to test the robustness of LLM in medical QA by modifying benchmark questions to confound the LLM. We target the MedQA benchmark's strong assumptions about patient characteristics and demonstrate successful "attacks" that trick the LLM into incorrect answers.

Towards a Personal Health Large Language Model

Paper • 2406.06474 • Published Jun 10 • 18

Note In this paper, we introduce the Personal Health Large Language Model (PH-LLM), which is fine-tuned from Gemini to interpret and reason over numerical time-series data related to personal health. We developed three datasets to evaluate PH-LLM's capabilities in generating personalized insights from sleep patterns and physical activity.

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Paper • 2406.06331 • Published Jun 10

Note This paper introduces MedExQA, a novel benchmark in medical QA to evaluate LLMs understanding of medical knowledge through explanations. By constructing datasets across five distinct currently-underrepresented medical specialties and by further incorporating multiple explanations for each question-answer pair, we address a major gap in current medical QA benchmarks.

MAIRA-2: Grounded Radiology Report Generation

Paper • 2406.04449 • Published Jun 6

Note In this work, we introduce a large multimodal model that combines a radiology-specific image encoder with a LLM for the task of grounded report generation on chest X-rays. The model utilizes comprehensive inputs including current and prior images and reports, as well as sections of the current report, to improve report quality and reduce hallucinations.

UltraMedical: Building Specialized Generalists in Biomedicine

Paper • 2406.03949 • Published Jun 6

Note In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks.

Enhancing Adverse Drug Event Detection with Multimodal Dataset: Corpus Creation and Model Development

Paper • 2405.15766 • Published May 24

Note In this work, we present a MultiModal Adverse Drug Event (MMADE) detection dataset, merging ADE-related textual information with visual aids. Additionally, we introduce a framework that leverages the capabilities of LLMs and VLMs for ADE detection by generating detailed descriptions of medical images depicting ADEs.

Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation

Paper • 2405.14905 • Published May 23

Note In this paper, we introduce a novel method, Structural Entities extraction and patient indications Incorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences.

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Paper • 2405.12701 • Published May 21 • 1

Note In this article, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We also propose OLAPH, a framework that enables the improvement of factuality through automatic evaluations by iteratively training LLMs to mitigate hallucinations, using sampling predictions and preference optimization.

COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain

Paper • 2405.10893 • Published May 17

Note In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text, accompanied with MCQAs.

Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser

Paper • 2405.09153 • Published May 15

Note This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale.

MedConceptsQA -- Open Source Medical Concepts QA Benchmark

Paper • 2405.07348 • Published May 12

Note We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various LLMs.

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Paper • 2405.05506 • Published May 9 • 1

Note In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora influence the outputs of LLMs.

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

Paper • 2405.02957 • Published May 5 • 1

Note In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. We show this knowledge is applicable to real-world benchmarks.

Aloe: A Family of Fine-tuned Open Healthcare LLMs

Paper • 2405.01886 • Published May 3 • 3

Note In this work, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on Mistral and LLaMA 3, using a new custom dataset which combines public data sources improved with synthetic CoT, with instruct tuning, model merging, alignment, red teaming and advanced inference schemes as improvement strategies.

Capabilities of Gemini Models in Medicine

Paper • 2404.18416 • Published Apr 29 • 23

Note In this report, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art performance on 10 of them.

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

Paper • 2404.16621 • Published Apr 25

Note We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. It offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs.

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

Paper • 2404.14779 • Published Apr 23 • 1

Note This study presents a comprehensive analysis and comparison of full-parameter vs parameter-efficient tuning, within the context of medical LLMs. We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities.

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Paper • 2404.12050 • Published Apr 18

Note In this work, we introduce emrQA-msquad, a medical dataset structured with the SQuAD V2.0 framework and enriched with emrQA medical information. It comprises 160k questions and 4k manually obtained answers, aimed at enhancing the accuracy of Medical QA systems. We also finetuned BERT-type models on the dataset.

MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Paper • 2404.10237 • Published Apr 16

Note In this work, we developed MoE-TinyMed, a model tailored for medical applications that significantly lowers parameter demands. In evaluations on the VQA-RAD, SLAKE, and Path-VQA datasets, MoE-TinyMed outperformed LLaVA-Med in all Med-VQA closed settings with just 3.6B parameters.

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Paper • 2404.07613 • Published Apr 11

Note In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain.

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Paper • 2404.05590 • Published Apr 8

Note In this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance.

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

Paper • 2404.00376 • Published Mar 30 • 3

Note We introduce Meerkat-7B, a novel medical AI system with 7 billion parameters. Meerkat-7B was trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets.

Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data

Paper • 2403.19031 • Published Mar 27

Note In this paper, we benchmarked various machine learning models, including classic SVMs, pretrained language models like RoBERTa, BERTweet, and SocBERT, and LLMs such as GPT-3.5 and GPT-4, across six text classification tasks using public social media data. We use LLMs either zero-shot, as annotator, or for data augmentation.

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

Paper • 2403.18421 • Published Mar 27 • 22

Note In this article, we release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical QA results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam.

A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages

Paper • 2403.18336 • Published Mar 27

Note This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types.

Large Language Models in Biomedical and Health Informatics: A Bibliometric Review

Paper • 2403.16303 • Published Mar 24 • 1

Note In this review, we conducted a bibliometric analysis of research articles and collaboration networks from 2022 to 2023 to understand the application of LLMs in Biomedical and Health Informatics. We mapped out key trends and major developments, highlighting how LLMs enhance NLP applications in medical diagnosis, patient engagement, and personalized medicine.

Large Language Model for Mental Health: A Systematic Review

Paper • 2403.15401 • Published Feb 19

Note In this review, we discuss the research methodology used in the paper. The methodology chapter explains the data collection and analysis methods, including the type of research conducted, data collection techniques, and any tools or materials used. It also justifies the methodological choices made, allowing readers to evaluate the reliability and validity of the research.

Polaris: A Safety-focused LLM Constellation Architecture for Healthcare

Paper • 2403.13313 • Published Mar 20 • 2

Note We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare, our work specifically focuses on long multi-turn voice conversations. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents.

Electrocardiogram Instruction Tuning for Report Generation

Paper • 2403.04945 • Published Mar 7 • 1

Note we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report.

Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People

Paper • 2403.03640 • Published Mar 6 • 2

Note In this article, we describe both the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, and the training of our Apollo models, state-of-the-art LLMs of various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B) which are capable of answering queries in the six most widely spoken languages world-wide.

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Paper • 2403.01924 • Published Mar 4 • 4

Note This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine, which entails constructing artificial contexts through prompting instead of retreiving the context from PubMed. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Paper • 2403.01528 • Published Mar 3 • 1

Note In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. The study begins with an overview of biomolecular representations and delves into the integration of linguistic and molecular data, assessing its practical applications and resources.

KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

Paper • 2403.01469 • Published Mar 3

Note We introduce KorMedMCQA, the first Korean multiple-choice QA benchmark derived from Korean healthcare professional licensing examinations, covering from the year 2012 to year 2023. This dataset consists of a selection of questions from the license examinations for doctors, nurses, and pharmacists, featuring a diverse array of subjects.

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Paper • 2403.00952 • Published Mar 1

Note In this work, we introduce MediSwift, a suite of efficient sparse pre-trained biomedical language models. By inducing up to 75% weight sparsity during pre-training on biomedical text data. The models are further refined through dense fine-tuning and strategic soft prompting, achieving sota results on several biomedical tasks.

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Paper • 2402.18060 • Published Feb 28 • 1

Note In this study, we construct two new datasets: JAMA Clinical Challenge and Medbullets. The first consists of questions based on challenging clinical cases, while the second comprises USMLE Step 2&3 style clinical questions. Both datasets are structured as multiple-choice QA tasks, where each question is accompanied by an expert-written explanation.

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Paper • 2402.16689 • Published Feb 26 • 1

Note In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform alternatives.

Towards Building Multilingual Language Model for Medicine

Paper • 2402.13963 • Published Feb 21 • 4

Note In this paper, we aim to develop an open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions. We construct MMedC, a new multilingual medical corpus, (25.5B tokens across 6 languages), a new MCQA benchmark with rationale. We then finetuned several LLMs and evaluated them on the benchmark.

Benchmarking Retrieval-Augmented Generation for Medicine

Paper • 2402.13178 • Published Feb 20 • 5

Note This work proposes the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets, and discovers a log-linear scaling property and the "lost-in-the-middle"effects in medical RAG.

Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Paper • 2402.10597 • Published Feb 16 • 2

Note In this study, we compare different Parameter Efficient Fine-tuning (PEFT) methods for clinical natural language processing tasks, using various sizes of language models. We evaluate the performance of these methods on three clinical tasks: de-identification, assertion detection, and mortality prediction.

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Paper • 2402.10373 • Published Feb 15 • 9

Note In this paper, we introduce BioMistral, an open-source LLM for the biomedical domain, utilizing Mistral as its foundation and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on 10 medical QA datasets in English. We also explore lightweight models obtained through quantization and model merging approaches.

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations

Paper • 2402.07023 • Published Feb 10 • 4

Note In this work, we evaluate both open-source and Google’s new multimodal LLM called Gemini across medical reasoning, hallucination detection, and medical visual question answering tasks. We also perform a detailed analysis by medical subject and test type. We release a Python module for medical LLM evaluation.

RareBench: Can LLMs Serve as Rare Diseases Specialists?

Paper • 2402.06341 • Published Feb 9

Note In this work, we introduce RareBench, a novel benchmark for assessing the performance of large language models (LLMs) on rare disease diagnosis and analysis. We also provide a rich dataset of rare disease cases, and a novel method to generate dynamic prompts using a rare disease knowledge graph. Our results show that our method improves LLMs’ diagnostic accuracy and interpretability.

Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset

Paper • 2402.05547 • Published Feb 8 • 1

Note We introduce ChatCoach, a system that helps medical students improve their communication skills with patients. It uses two AI agents: one that acts as a patient and one that acts as a coach. The student can talk to the patient agent and get feedback from the coach agent in real time. We compare the performance of ChatGPT and Llama2 for this task.

SA-MDKIF: A Scalable and Adaptable Medical Domain Knowledge Injection Framework for Large Language Models

Paper • 2402.00474 • Published Feb 1

Note In this study, we present SA-MDKIF, a framework that aims to inject medical knowledge into LLMs through instruction tuning, thereby enabling adaptability for various downstream tasks. SA-MDKIF consists of two stages: skill training and skill adaptation. We train a skill router to integrate the acquired skills with LLMs during inference.

Multimodal Clinical Pseudo-notes for Emergency Department Prediction Tasks using Multiple Embedding Model for EHR (MEME)

Paper • 2402.00160 • Published Jan 31

Note In this work, we introduce MEME, an approach that views EHR as multimodal data. This approach incorporates "pseudo-notes", textual representations of tabular EHR concepts such as diagnoses and medications, and allows us to effectively employ LLMs for EHR representation.

Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

Paper • 2401.15713 • Published Jan 28 • 2

Note In this paper, we target this issue by assembling niche datasetsusing co-citations as a similarity metric, focusing on biomedical domains. We employ two keystrategies: 1. Domain-specific Fine-Tuning, and 2. Universal Applicability with Mixture of Experts (MoE), adapting pretrained models with enforced routing for multiple domains simultaneously.

K-QA: A Real-World Medical Q&A Benchmark

Paper • 2401.14493 • Published Jan 25

Note We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health. We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models.

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Paper • 2401.14490 • Published Jan 25 • 3

Note We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents.

PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge

Paper • 2401.11048 • Published Jan 19 • 2

Note PubTator 3.0 is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles.

Towards Conversational Diagnostic AI

Paper • 2401.05654 • Published Jan 11 • 16

Note In this work, we introduce AMIE (Articulate Medical Intelligence Explorer), a LLM-based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts.

PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Paper • 2401.02797 • Published Jan 5

Note In this paper, we propose a parameter efficient framework for fine-tuning MLLM specifically tailored to Med-VQA applications, and empirically validate it on a public benchmark dataset. We outperform the GPT-4v model by a significant margin of 26% absolute accuracy on closed-ended questions, based on a human evaluation.

Generalist embedding models are better at short-context clinical semantic search than specialized embedding models

Paper • 2401.01943 • Published Jan 3 • 6

Note This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain.

MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries

Paper • 2401.01596 • Published Jan 3

Note This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids.

Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing

Paper • 2401.00579 • Published Dec 31, 2023 • 2

Note Our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately 200,000 instruction-focused samples.

Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

Paper • 2312.00567 • Published Dec 1, 2023

Note We present a new dataset which (i) includes explanatory arguments for both correct and incorrect answers; (ii) written by medical doctors to answer questions from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors.

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Paper • 2311.16079 • Published Nov 27, 2023 • 20

Note In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2, and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines.

BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

Paper • 2311.16075 • Published Nov 27, 2023 • 6

Note In this paper, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models and introduce BioLORD-2023, a state-of-the-art model for semantic textual similarity and biomedical concept representation designed for the clinical domain.

Overview of Current Applications of Large Language Models in Various Medical Specialities

Paper • 2311.12882 • Published Oct 28, 2023 • 1

Note This paper gives an overview of the latest applications of Large Language Models (LLMs) in the healthcare sector, highlighting their transformative role in enhancing medical care quality. We explore their utilization in various medical specialties, such as cancer diagnostics, dentistry, nephrology, dermatology, etc.

KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model

Paper • 2311.11564 • Published Nov 20, 2023 • 1

Note We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach. We achieve a biomedical multilingual corpus by incorporating three granularity knowledge alignments (entity, fact, and passage levels) into monolingual corpora.

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

Paper • 2311.10537 • Published Nov 16, 2023 • 3

Note We propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages LLM-based agents playing different roles and participating in a cooperative dialogue, which enhances their LLM competencies and reasoning skills. This framework is training-free and intuitive.

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs

Paper • 2311.09774 • Published Nov 16, 2023 • 1

Note We propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine.

Autoregressive Language Models For Estimating the Entropy of Epic EHR Audit Logs

Paper • 2311.06401 • Published Nov 10, 2023 • 1

Note Existing techniques to measure the complexity of workflow through EHR audit logs involve time- or frequency-based cross-sectional aggregations that are unable to capture the full complexity of a EHR session. We evaluate the usage of transformer-based tabular LMs in measuring the entropy of action sequences within workflow and release the evaluated models publicly.

Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach

Paper • 2311.06364 • Published Nov 10, 2023 • 1

Note We address the challenge of developing Relation Extraction models in biomedical areas, focusing on the sparsity of labeled data, particularly in the natural-products literature. We introduce a novel Greedy Maximum Entropy sampler to create a curated evaluation dataset and training sets using the LOTUS database.

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Paper • 2311.06025 • Published Nov 10, 2023 • 1

Note We propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, SFT, and RLHF; and evaluations on real-world tasks including information extraction, question answering, and dialogue generation.

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Paper • 2310.19975 • Published Oct 30, 2023 • 1

Note We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from 80 human curated instructions. We then evaluated instruction-tuned LLMs on several BioNLP tasks.

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Paper • 2310.14088 • Published Oct 21, 2023 • 1

Note This study assesses the ability of state-of-the-art large language models (LLMs) including GPT-3.5, GPT-4, Falcon, and LLaMA 2 to identify patients with mild cognitive impairment (MCI) from discharge summaries and examines instances where the models' responses were misaligned with their reasoning.

Rather a Nurse than a Physician -- Contrastive Explanations under Investigation

Paper • 2310.11906 • Published Oct 18, 2023 • 1

Note Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain decisions. We fine-tune and extract explanations from 3 chat models. A comparison between human and model rationales, both in contrastive and non-contrastive settings, shows that humans do not necessarily explain in a contrastive manner.

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

Paper • 2310.11275 • Published Oct 17, 2023 • 1

Note We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model.

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

Paper • 2310.11266 • Published Oct 17, 2023 • 1

Note We introduce BooksMed, a novel framework based on a Large Language Model (LLM) which uniquely emulates human cognitive processes to deliver evidence-based and reliable responses, utilizing the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength.

JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning

Paper • 2310.10083 • Published Oct 16, 2023 • 2

Note We show the contribution of LoRA-based instruction-tuning to performance in Japanese medical question-answering tasks. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects.

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Paper • 2310.07276 • Published Oct 11, 2023 • 5

Note We propose BioT5, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. BioT5 utilizes SELFIES for 100 robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature.

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Paper • 2310.05694 • Published Oct 9, 2023 • 3

Note This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs.

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

Paper • 2310.00274 • Published Sep 30, 2023 • 3

Note We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.

MedEdit: Model Editing for Medical Question Answering with External Knowledge Bases

Paper • 2309.16035 • Published Sep 27, 2023 • 1

Note Our study delves into model editing utilizing in-context learning, aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then we incorporate them into the query prompt for the LLM.

Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts

Paper • 2309.13202 • Published Sep 22, 2023 • 1

Note In this work, we investigate the ability of state-of-the-art large language models (LLMs) on the task of biomedical abstract simplification, using the publicly available dataset for plain language adaptation of biomedical abstracts (PLABA).

HealthFC: A Dataset of Health Claims for Evidence-Based Medical Fact-Checking

Paper • 2309.08503 • Published Sep 15, 2023 • 1

Note We introduce a dataset of 750 health-related claims, labeled for veracity by medical experts and backed with evidence from appropriate clinical studies. The dataset can be used for tasks related to automated fact-checking such as evidence retrieval, veracity prediction, and explanation generation.

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Paper • 2309.07430 • Published Sep 14, 2023 • 27

Note In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods.

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Paper • 2309.00237 • Published Sep 1, 2023 • 3

Note In this article, we create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones.

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

Paper • 2308.16458 • Published Aug 31, 2023 • 10

Note We present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates functions and methods in Python and Java from GitHub and the Rosalind Project.

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Paper • 2308.14089 • Published Aug 27, 2023 • 29

Note We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs.

CMB: A Comprehensive Medical Benchmark in Chinese

Paper • 2308.08833 • Published Aug 17, 2023 • 1

Note We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

Paper • 2308.08625 • Published Aug 16, 2023 • 2

Note This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We also propose and evaluate initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found.

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Paper • 2308.06354 • Published Aug 11, 2023 • 3

Note This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data.

Med-HALT: Medical Domain Hallucination Test for Large Language Models

Paper • 2307.15343 • Published Jul 28, 2023 • 2

Note This research paper focuses on the challenges posed by hallucinations in LLMs, particularly in the context of the medical domain. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests.

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Paper • 2307.14385 • Published Jul 26, 2023 • 2

Note In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data.

Towards Generalist Biomedical AI

Paper • 2307.14334 • Published Jul 26, 2023 • 11

Note Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin.

Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section

Paper • 2307.07051 • Published Jul 13, 2023 • 1

Note We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.

Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

Paper • 2307.06439 • Published Jul 12, 2023 • 9

Note In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs.

EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models

Paper • 2307.02028 • Published Jul 5, 2023 • 3

Note First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Third, we define 15 few-shot clinical prediction tasks.

BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Paper • 2307.00589 • Published Jul 2, 2023 • 1

Note We introduce BioCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot biomedical IR. To train BioCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker.

How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain

Paper • 2307.00186 • Published Jul 1, 2023 • 1

Note This paper aims to provide a thorough investigation to compare the performance of LMs in medical few-shot NER and answer How far is LMs from 100% Few-shot NER in Medical Domain, and moreover to explore an effective entity recognizer to help improve the NER performance.

Biomedical Language Models are Robust to Sub-optimal Tokenization

Paper • 2306.17649 • Published Jun 30, 2023 • 1

Note In this work, we first find that standard open-domain and biomedical tokenizers are largely unable to segment biomedical terms into meaningful components. But surprisingly, we find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.

CamemBERT-bio: a Tasty French Language Model Better for your Health

Paper • 2306.15550 • Published Jun 27, 2023 • 3

Note We propose a new French public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus, we introduce a first version of CamemBERT-bio, a specialized public model for the French biomedical domain that shows 2.54 points of F1 score improvement on average on different biomedical named entity recognition tasks.

Radiology-GPT: A Large Language Model for Radiology

Paper • 2306.08666 • Published Jun 14, 2023 • 1

Note We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA.

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

Paper • 2306.08018 • Published Jun 13, 2023 • 4

Note We introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions.

Multilingual Clinical NER: Translation or Cross-lingual Transfer?

Paper • 2306.04384 • Published Jun 7, 2023 • 1

Note This paper compares cross-lingual transfer with these two alternative methods, to perform clinical NER in French and in German without any training data in those languages. To this end, we release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset.

ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Paper • 2306.02022 • Published Jun 3, 2023 • 1

Note In this paper, we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Paper • 2306.00890 • Published Jun 1, 2023 • 10

Note We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct instruction-following data from the captions.

BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks

Paper • 2305.17100 • Published May 26, 2023 • 2

Note In this paper, we introduce a unified and generalist Biomedical Generative Pre-trained

Towards Expert-Level Medical Question Answering with Large Language Models

Paper • 2305.09617 • Published May 16, 2023 • 5

Note We present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach.

Dr. LLaMA: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation

Paper • 2305.07804 • Published May 12, 2023 • 2

Note In this paper, we introduce Dr. LLaMA, a method for improving SLMs through generative data augmentation using LLMs, focusing on medical question-answering tasks and the PubMedQA dataset. Our findings indicate that LLMs effectively refine and diversify existing question-answer pairs.

RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models

Paper • 2305.01146 • Published May 2, 2023 • 1

Note We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS). Our results on the MIMIC-III dataset consistently demonstrate best performance by maximally adapting to the task via pretraining on clinical text and parameter-efficient fine-tuning on RRS examples.

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

Paper • 2304.08999 • Published Apr 18, 2023 • 2

Note In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. Since there wasno annotated corpus for biomedical entity extraction in Portuguese prior to this work, we also present the strategy we followed in annotating the corpus for the development of the models.

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

Paper • 2304.00958 • Published Apr 3, 2023 • 1

Note In this paper, we propose an original study of PLMs in the medical domain on French language. We also release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.

ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge

Paper • 2303.14070 • Published Mar 24, 2023 • 11

Note We collected more than 700 diseases and their corresponding symptoms, recommended medications, and required medical tests, and then generated 5K doctor-patient conversations. Models finetuned on these emerge with great potential to understand patients' needs, provide informed advice, and offer valuable assistance in a variety of medical-related fields.

Capabilities of GPT-4 on Medical Challenge Problems

Paper • 2303.13375 • Published Mar 20, 2023 • 1

Note We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a tuned version of Flan-PaLM 540B).

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Paper • 2303.08179 • Published Mar 14, 2023 • 2

Note The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the model, this paper also conducts an in-depth analysis of its capabilities.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Paper • 2303.01229 • Published Mar 1, 2023 • 1

Note Large language models have a tendency to generate factually incorrect and sometimes even toxic statements. By enabling these models to access external point-of-care tools in response to physician queries, we demonstrate significantly improved factual grounding, helpfulness, and safety in a variety of clinical scenarios.

Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Paper • 2303.00915 • Published Mar 2, 2023 • 6

Note In this paper, we conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches.

Do We Still Need Clinical Language Models?

Paper • 2302.08091 • Published Feb 16, 2023 • 3

Note We show that relatively small specialized clinical models substantially outperform all in-context learning approaches, even when finetuned on limited annotated data. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text.

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Paper • 2301.07695 • Published Jan 16, 2023 • 1

Note We present a new text-to-SQL dataset for electronic health records (EHRs). The utterances were collected from 222 hospital staff, including physicians, nurses, insurance review and health records teams, and more. Our dataset poses unique challenges: 1) generate SQL queries, 2) understand various time expressions, and 3) distinguish whether a given question is answerable.

Large Language Models Encode Clinical Knowledge

Paper • 2212.13138 • Published Dec 26, 2022 • 3

Note We present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA.

Scientific and Creative Analogies in Pretrained Language Models

Paper • 2211.15268 • Published Nov 28, 2022 • 1

Note This paper examines the encoding of analogy in large-scale pretrained language models. Existing analogy datasets typically focus on a limited set of analogical relations, with a high similarity of the two domains between which the analogy holds. On the other hand, SCAN contains systematic mappings of multiple attributes and relational structures across dissimilar domains.

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Paper • 2211.12737 • Published Nov 23, 2022 • 2

Note We fine-tuned a diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We present evidence that the resulting model is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled by using free-form text prompts including radiology-specific language.

A Large-Scale Dataset for Biomedical Keyphrase Generation

Paper • 2211.12124 • Published Nov 22, 2022 • 1

Note We introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation.

AF Adapter: Continual Pretraining for Building Chinese Biomedical Language Model

Paper • 2211.11363 • Published Nov 21, 2022 • 1

Note Sequential task training may cause catastrophic forgetting, so we propose a continual pretraining method for the BERT-based model. Despite training only 3% of model parameters, our method could achieve better-than-SOTA performance (on chinese biomedical tasks).

Galactica: A Large Language Model for Science

Paper • 2211.09085 • Published Nov 16, 2022 • 4

Note In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. It sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%.

BioLORD: Learning Ontological Representations from Definitions (for Biomedical Concepts and their Textual Descriptions)

Paper • 2210.11892 • Published Oct 21, 2022 • 2

Note In this work, we propose a new method for learning vector representations of biomedical terms that are based on definitions and descriptions from a knowledge graph. Thanks to this grounding, our model produces more semantic concept representations than SapBERT, and which match more closely the hierarchical structure of ontologies. The model also generalizes to clinical sentences similarity (STS).