Spaces:
Configuration error
Configuration error
Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,17 @@
|
|
| 2 |
title: >-
|
| 3 |
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
|
| 4 |
Understanding and Reasoning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
<div align="center">
|
|
@@ -12,98 +23,77 @@ title: >-
|
|
| 12 |
|
| 13 |
[](https://arxiv.org/pdf/2505.07889)
|
| 14 |
[](https://huggingface.co/BioProBench)
|
|
|
|
|
|
|
| 15 |
[](https://creativecommons.org/licenses/by/4.0/)
|
| 16 |
-
[](https://github.com/yourusername/bioprotocolbench/pulls)
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
-
|
|
|
|
| 23 |
|
| 24 |
<div align="center">
|
| 25 |
-
<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/overview.png?raw=true" alt="BioProBench
|
| 26 |
</div>
|
| 27 |
|
| 28 |
-
|
| 29 |
* π **Large-scale Data:** Built upon **27K original biological protocols**, yielding nearly **556K high-quality structured instances**.
|
| 30 |
-
* π― **Comprehensive Tasks:**
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
* Step Ordering (ORD)
|
| 34 |
-
* Error Correction (ERR)
|
| 35 |
-
* Protocol Generation (GEN)
|
| 36 |
-
* Protocol Reasoning (REA)
|
| 37 |
-
* 𧬠**Broad Domain Coverage:** Data sourced from 6 major repositories and covering **16 biological subdomains**.
|
| 38 |
-
* π¬ **Standardized Evaluation:** A robust framework combining standard NLP metrics with novel domain-specific measures for accurate performance quantification.
|
| 39 |
-
|
| 40 |
-
---
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
## π Motivation
|
| 44 |
-
|
| 45 |
-
Biological protocols are the operational blueprint for experiments. As biological research increasingly leverages automation and AI, the ability of AI systems to understand and reason about these complex procedures is paramount. Current LLMs, while powerful, face significant challenges:
|
| 46 |
-
|
| 47 |
-
* Limited Procedural Understanding:
|
| 48 |
-
* LLMs struggle with the temporal dependencies, conditional logic, and specific requirements embedded within protocols.
|
| 49 |
-
* Lack of Systematic Evaluation: There has been a lack of large-scale, multi-task benchmarks specifically designed to diagnose LLMs' limitations on procedural biological texts.
|
| 50 |
-
* Bridging the Gap: Developing AI systems capable of safely automating and even optimizing experiments requires models that can reliably interpret and generate protocols.
|
| 51 |
-
|
| 52 |
-
BioProBench addresses these challenges by providing the necessary data and tasks for comprehensive evaluation and driving the development of more capable models.
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
-
## π Dataset Structure
|
| 57 |
|
| 58 |
<div align="center">
|
| 59 |
-
<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/samples.jpg?raw=true" alt="BioProBench
|
| 60 |
</div>
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
* **
|
| 71 |
-
* **Hugging Face Dataset:** [https://huggingface.co/BioProBench](https://huggingface.co/BioProBench)
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
-
##
|
| 76 |
-
We evaluated 12 mainstream
|
| 77 |
-
|
| 78 |
-
*
|
| 79 |
-
*
|
| 80 |
-
* Model Variances: Comparisons show diverse performance across models. Certain open-source models approach the performance of closed-source models on some tasks, while smaller, bio-specific models often lag behind general LLMs on complex procedural content, suggesting limitations in capacity for capturing intricate dependencies.
|
| 81 |
-
|
| 82 |
-
Overall, our findings underscore that robust procedural reasoning within biological protocols represents a significant challenge for current LLMs.
|
| 83 |
|
| 84 |
---
|
| 85 |
|
| 86 |
-
## π€ Contributing
|
| 87 |
-
We welcome contributions
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
- π§ Novel evaluation tasks
|
| 91 |
-
- π Annotation improvements
|
| 92 |
-
|
| 93 |
-
---
|
| 94 |
|
| 95 |
## π Citation
|
| 96 |
```bibtex
|
| 97 |
@misc{bioprotocolbench2025,
|
| 98 |
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
|
| 99 |
-
author={Yuyang Liu
|
| 100 |
year={2025},
|
| 101 |
-
url={https://arxiv.org/pdf/2505.07889}
|
| 102 |
-
}
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
---
|
| 106 |
-
|
| 107 |
-
## π§ Contact
|
| 108 |
-
For dataset access or collaboration inquiries:
|
| 109 |
-
sunshineliuyuyang@gmail.com
|
|
|
|
| 2 |
title: >-
|
| 3 |
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
|
| 4 |
Understanding and Reasoning
|
| 5 |
+
license: cc-by-4.0
|
| 6 |
+
task_categories:
|
| 7 |
+
- text-generation
|
| 8 |
+
- question-answering
|
| 9 |
+
language:
|
| 10 |
+
- en
|
| 11 |
+
tags:
|
| 12 |
+
- biology
|
| 13 |
+
- protocol
|
| 14 |
+
- benchmark
|
| 15 |
+
- ai4science
|
| 16 |
---
|
| 17 |
|
| 18 |
<div align="center">
|
|
|
|
| 23 |
|
| 24 |
[](https://arxiv.org/pdf/2505.07889)
|
| 25 |
[](https://huggingface.co/BioProBench)
|
| 26 |
+
[](https://github.com/YuyangSunshine/bioprotocolbench)
|
| 27 |
+
[](https://yuyangsunshine.github.io/BioPro-Project/)
|
| 28 |
[](https://creativecommons.org/licenses/by/4.0/)
|
|
|
|
| 29 |
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## π’ Latest News
|
| 33 |
+
* β¨ **[2026-03-31]** **Data Split Update!** We have officially released the **Train/Test splits** for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
|
| 34 |
+
* π₯ **[2026-03-18]** Our **BioProAgent** is now live on AI4S LAB! [Try it out and order wet-lab experiments here](https://yuyangsunshine.github.io/BioPro-Project/).
|
| 35 |
+
* π **[2026-03-03]** Our BioProAgent has been accepted by the **ICLR 2026 LLA Workshop!**
|
| 36 |
+
* π **[2026-03-01]** The preprint of our **BioProAgent** paper is available on [arXiv](https://arxiv.org/pdf/2505.07889).
|
| 37 |
+
* π **[2026-01-21]** BioProBench paper has been updated with new experimental results.
|
| 38 |
+
* π **[2025-12-01]** Code and dataset (v1.0) are released on GitHub.
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
+
## π Introduction
|
| 43 |
+
**BioProBench** is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.
|
| 44 |
|
| 45 |
<div align="center">
|
| 46 |
+
<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/overview.png?raw=true" alt="BioProBench Overview" width="1000"/>
|
| 47 |
</div>
|
| 48 |
|
| 49 |
+
### Key Features:
|
| 50 |
* π **Large-scale Data:** Built upon **27K original biological protocols**, yielding nearly **556K high-quality structured instances**.
|
| 51 |
+
* π― **Comprehensive Tasks:** 5 core tasks: **PQA** (Question Answering), **ORD** (Step Ordering), **ERR** (Error Correction), **GEN** (Generation), and **REA** (Reasoning).
|
| 52 |
+
* 𧬠**Broad Domain Coverage:** Covers **16 biological subdomains** from 6 major repositories.
|
| 53 |
+
* π¬ **Standardized Evaluation:** A robust framework combining NLP metrics with novel domain-specific measures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
---
|
| 56 |
|
| 57 |
+
## π Dataset Structure & Tasks
|
| 58 |
|
| 59 |
<div align="center">
|
| 60 |
+
<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/samples.jpg?raw=true" alt="BioProBench Samples" width="1000"/>
|
| 61 |
</div>
|
| 62 |
|
| 63 |
+
We provide standardized JSON files for each task, now including **Train** and **Test** splits:
|
| 64 |
|
| 65 |
+
| Task | Description | Files |
|
| 66 |
+
| :--- | :--- | :--- |
|
| 67 |
+
| **PQA** | Protocol Question Answering | `PQA_train.json`, `PQA_test.json` |
|
| 68 |
+
| **ORD** | Step Ordering | `ORD_train.json`, `ORD_test.json` |
|
| 69 |
+
| **ERR** | Error Correction | `ERR_train.json`, `ERR_test.json` |
|
| 70 |
+
| **GEN** | Protocol Generation | `GEN_train.json`, `GEN_test.json` |
|
| 71 |
+
| **Raw** | Full Protocol Corpus | `Bio-protocol.json`, `Protocol-io.json`, etc. |
|
| 72 |
|
| 73 |
+
### π Useful Links
|
| 74 |
+
* **Official Website:** [BioPro-Project Page](https://yuyangsunshine.github.io/BioPro-Project/)
|
| 75 |
+
* **GitHub Repository:** [bioprotocolbench](https://github.com/YuyangSunshine/bioprotocolbench) (Code for evaluation & training)
|
|
|
|
| 76 |
|
| 77 |
---
|
| 78 |
|
| 79 |
+
## π¬ Key Findings
|
| 80 |
+
We evaluated 12 mainstream LLMs. Our findings reveal:
|
| 81 |
+
* **Surface vs. Deep Understanding:** Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
|
| 82 |
+
* **Reasoning Bottleneck:** Performance drops significantly on **Step Ordering** and **Protocol Generation** (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
|
| 83 |
+
* **Bio-specific Models:** Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
+
## π€ Contributing & Contact
|
| 88 |
+
We welcome contributions such as new protocol sources, additional domains, or novel tasks!
|
| 89 |
+
- **Email:** sunshineliuyuyang@gmail.com
|
| 90 |
+
- **Issues:** Feel free to open an issue on our [GitHub](https://github.com/YuyangSunshine/bioprotocolbench).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
## π Citation
|
| 93 |
```bibtex
|
| 94 |
@misc{bioprotocolbench2025,
|
| 95 |
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
|
| 96 |
+
author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
|
| 97 |
year={2025},
|
| 98 |
+
url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
|
| 99 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|