jang1563
/

BioRLHF

Model card Files Files and versions

xet

Community

jang1563 commited on 3 days ago

Commit

73bdf88

verified ·

1 Parent(s): 3bb53a8

docs(readme): replace em dashes with cleaner punctuation

Browse files

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -7,7 +7,7 @@
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
-**Biological Reinforcement Learning from Human Feedback** — A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
 ## Highlights
@@ -16,7 +16,7 @@
 - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547)
 - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO
 - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage)
-- **Learns from 363 examples** — efficient domain adaptation from spaceflight transcriptomics data
 ## Key Results
@@ -197,8 +197,8 @@ reward = composer.score(question, response, ground_truth)
 Training data is derived from a 2x2x2 factorial transcriptomic study:
 - **Drug**: Kaempferol (KMP) vs Control
-- **Stressor 1**: Hindlimb Unloading (HU) — simulates microgravity
-- **Stressor 2**: Ionizing Radiation (IR) — simulates space radiation
 - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
 ### Training Example Types
@@ -289,15 +289,15 @@ BioRLHF/
 ## Key Learnings for AI Safety
-1. **Honesty is trainable** — Models can learn appropriate epistemic humility
-2. **Domain grounding matters** — Anchoring to experimental truth prevents hallucination
-3. **Multi-reward > single reward** — Decomposing correctness into verifiable dimensions improves learning signal
-4. **Preference learning is fragile** — DPO can catastrophically forget domain knowledge
-5. **Evaluation drives improvement** — Systematic testing reveals specific failure modes
 ## Related Projects
-- **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)** — 115-question benchmark for LLMs on spaceflight biomedical data
 ## Citation
@@ -319,7 +319,7 @@ Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for gui
 ## License
-This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.
 ---

 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
+**Biological Reinforcement Learning from Human Feedback**: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
 ## Highlights
 - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547)
 - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO
 - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage)
+- **Learns from 363 examples**: efficient domain adaptation from spaceflight transcriptomics data
 ## Key Results
 Training data is derived from a 2x2x2 factorial transcriptomic study:
 - **Drug**: Kaempferol (KMP) vs Control
+- **Stressor 1**: Hindlimb Unloading (HU): simulates microgravity
+- **Stressor 2**: Ionizing Radiation (IR): simulates space radiation
 - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
 ### Training Example Types
 ## Key Learnings for AI Safety
+1. **Honesty is trainable**: Models can learn appropriate epistemic humility
+2. **Domain grounding matters**: Anchoring to experimental truth prevents hallucination
+3. **Multi-reward > single reward**: Decomposing correctness into verifiable dimensions improves learning signal
+4. **Preference learning is fragile**: DPO can catastrophically forget domain knowledge
+5. **Evaluation drives improvement**: Systematic testing reveals specific failure modes
 ## Related Projects
+- **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)**: 115-question benchmark for LLMs on spaceflight biomedical data
 ## Citation
 ## License
+This project is licensed under the MIT License: see the [LICENSE](LICENSE) file for details.
 ---