pedromoreira22 commited on
Commit
727d887
β€’
1 Parent(s): f950074

first draft blog

Browse files
Files changed (1) hide show
  1. app.py +133 -2
app.py CHANGED
@@ -62,6 +62,135 @@ df['Average Accuracy (Original and G2B)'] = (df['Average G2B Accuracy'] + df['Av
62
  #df['Adjusted Robustness Score'] = df['Adjusted Robustness Score'].round(2)
63
 
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
 
67
 
@@ -323,7 +452,9 @@ with gr.Blocks(css="custom.css") as demo:
323
  elem_classes="markdown-text"
324
  )
325
 
326
-
 
 
327
 
328
  with gr.Row():
329
  bar3 = gr.Plot(
@@ -342,4 +473,4 @@ with gr.Blocks(css="custom.css") as demo:
342
 
343
 
344
  if __name__ == "__main__":
345
- demo.launch()
 
62
  #df['Adjusted Robustness Score'] = df['Adjusted Robustness Score'].round(2)
63
 
64
 
65
+ # Blog posts content with images
66
+ blog_posts = [
67
+ {"title": "TLDR",
68
+ "content": '''
69
+
70
+ **Why Are Language Models Surprisingly Fragile to Drug Names? πŸ©ΊπŸ’Š**
71
+
72
+ Language models (LLMs) like GPT are transforming medicine with their data processing and decision-support capabilities. But there's a twist: they seem to be great at memorising but pretty bad at connecting concepts like drug names! Researchers behind the RABBITS dataset (Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions for Language Models) set out to evaluate how LLMs perform when brand names (like Advil) are swapped with their generic equivalents (like ibuprofen) in datasets like MedMCQA.
73
+
74
+ The results? A surprising performance drop of up to 10%! πŸ“‰ On average, accuracy dipped by 4% when brand names were swapped for generics. This is concerning because a simple drug name swap could lead to serious medical misinformation and errors. πŸš‘
75
+
76
+ **Why does this happen?**
77
+
78
+ 1. **Data Contamination:** One major factor is that LLMs are often trained on datasets that include test data (Over 90% of MedQA questions appeared to some extent in Dolma Dataset), leading to inflated performance metrics. When faced with new, unseen data, their performance drops significantly.
79
+
80
+ 2. **Memorization Over Understanding:** Larger models like Llama-3-70B show a greater drop in accuracy, suggesting they rely more on memorization than genuine comprehension. For instance, Llama-3-70B's accuracy fell from 76.6% to 69.7% with generic-to-brand swaps.
81
+
82
+ **The Study's Findings**
83
+
84
+ A key graph in the paper (Figure 2) shows the performance of different models on the original dataset versus the swapped (generic-to-brand) dataset. The dashed diagonal line represents the ideal scenario where synonym swaps don't affect performance. All open-source models from 7B parameters and above fell below this line, indicating decreased performance with drug name swaps.
85
+
86
+ ![Performance Graph](file/b4b_tight.png)
87
+
88
+
89
+ **Conclusion**
90
+
91
+ The RABBITS study highlights a critical area for improvement in medical AI. While LLMs have enormous potential, they need to be more robust and accurate to avoid the pitfalls of drug name variability. So, the next time you ask an AI about meds, remember: it's still learning the ropes with all those tricky drug names! πŸ§ πŸ’‘
92
+
93
+ Check out the RABBITS leaderboard on Hugging Face to see how different models stack up!
94
+ '''},
95
+ {"title": "Motivation/Problem 🩺",
96
+ "content": '''
97
+
98
+ **Unveiling the Fragility of Language Models to Drug Names πŸ©ΊπŸ’Š**
99
+
100
+ ### Motivation and Problem
101
+
102
+ Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where precision is paramount.
103
+
104
+ ### What We Did
105
+
106
+ To tackle this problem, we developed a specialized dataset called RABBITS (Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions for Language Models). Here's what we did:
107
+
108
+ 1. **Dataset Creation:** We used the RxNorm database to generate a comprehensive list of brand and generic drug pairs. This involved identifying 2,271 generic drugs and mapping them to 6,961 brands.
109
+
110
+ 2. **Data Transformation:** Using regular expressions, we created two versions of medical QA datasets (MedQA and MedMCQA): one with brand names swapped to generics and one with generics swapped to brand names.
111
+
112
+ 3. **Expert Review:** The transformed datasets were rigorously reviewed by physician experts to ensure accuracy and context consistency.
113
+
114
+ 4. **Evaluation:** We evaluated various open-source and API-based LLMs on these transformed datasets using the EleutherAI lm-evaluation harness in a zero-shot setting. The goal was to measure performance differences when drug names were swapped.
115
+
116
+ ![Workflow](file/workflow.png)
117
+ ### Results
118
+
119
+ The performance of the models was summarized in a leaderboard, showcasing the impact of drug name variability on their accuracy. Here’s how they ranked:
120
+
121
+ | Model | Original Accuracy | Swapped Accuracy (g2b) | Difference |
122
+ |---------------------|-------------------|-------------------------|------------|
123
+ | GPT-3.5-turbo-0125 | 97.29% | 96.86% | -0.42% |
124
+ | GPT-4o | 90.36% | 87.42% | -2.94% |
125
+ | GPT4-0613 | 92.00% | 89.37% | -2.63% |
126
+ | Gemini 1 Pro | 69.36% | 73.44% | +4.07% |
127
+ | Gemini 1.5 Flash | 97.25% | 95.43% | -1.82% |
128
+ | Llama-3-70B | 76.64% | 69.71% | -6.93% |
129
+ | Mixtral-8x22B-v0.1 | 70.92% | 64.62% | -6.29% |
130
+
131
+ The leaderboard clearly shows that swapping drug names causes a notable drop in accuracy for most models, highlighting the need for more robust LLMs in the medical domain.
132
+
133
+ ### Conclusion
134
+
135
+ The RABBITS dataset sheds light on a critical weakness in current language models' handling of drug names. While LLMs hold great promise, their ability to accurately interpret and respond to drug-related queries still needs significant improvement. Our research underscores the importance of robustness in AI for healthcare to ensure safe and reliable patient support.
136
+
137
+ Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
138
+ '''},
139
+ {"title": "DrugMathQA task (b4bqa)",
140
+ "content": '''
141
+
142
+ **Exploring the DrugMathQA Task: Uncovering Hidden Challenges in Language Models πŸ©ΊπŸ“Š**
143
+
144
+ ### What We Did
145
+
146
+ Wwe introduced the DrugMathQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
147
+
148
+ 1. **Creating the DrugMathQA Task (b4bqa):**
149
+ - We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
150
+ - Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
151
+
152
+ 2. **Dolma Dataset Counting:**
153
+ - We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
154
+ - We identified overlaps between drug names in Dolma and the test sets of MedQA and MedMCQA, revealing significant contamination. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
155
+
156
+ ### Results
157
+
158
+ Our evaluation focused on comparing LLM performance on the original datasets versus the transformed ones (g2b and b2g). Here's a summary of our findings:
159
+
160
+ #### Dataset Analysis
161
+
162
+ | Subset | Terms | Average | Median | Std. Dev |
163
+ |--------------|---------|----------|---------|--------------|
164
+ | Dolma | Generic | 564,151 | 136,682 | 2,399,928 |
165
+ | | Brand | 234,138 | 698 | 2,543,075 |
166
+ | Red Pajama | Generic | 161,227 | 42,549 | 620,393 |
167
+ | | Brand | 29,561 | 84 | 232,661 |
168
+ | Pile train | Generic | 96,309 | 28,074 | 325,307 |
169
+ | | Brand | 4,757 | 19 | 43,613 |
170
+ | C4 train | Generic | 27,973 | 5,454 | 144,162 |
171
+ | | Brand | 9,941 | 26 | 96,504 |
172
+
173
+
174
+ #### Contamination
175
+
176
+ | Dataset | Percentage |
177
+ |---------------|------------|
178
+ | MedQA Train | 86.92% |
179
+ | MedQA Val | 98.10% |
180
+ | MedQA Test | 99.21% |
181
+ | MedMCQA Train | 22.41% |
182
+ | MedMCQA Val/Test | 34.13% |
183
+
184
+ **Key Takeaways:**
185
+
186
+ - **Performance Drop:** Most models experienced a notable drop in accuracy with the g2b swaps, highlighting their reliance on memorization rather than true comprehension.
187
+ - **Contamination Impact:** The high overlap of test data with the Dolma dataset likely inflated original performance metrics, masking the true challenges LLMs face in handling drug name variability.
188
+
189
+ ### Conclusion
190
+
191
+ The DrugMathQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
192
+ '''}
193
+ ]
194
 
195
 
196
 
 
452
  elem_classes="markdown-text"
453
  )
454
 
455
+ for post in blog_posts:
456
+ with gr.Accordion(post["title"]):
457
+ gr.Markdown(post["content"])
458
 
459
  with gr.Row():
460
  bar3 = gr.Plot(
 
473
 
474
 
475
  if __name__ == "__main__":
476
+ demo.launch(allowed_paths=["/"])