jang1563 commited on
Commit
73bdf88
Β·
verified Β·
1 Parent(s): 3bb53a8

docs(readme): replace em dashes with cleaner punctuation

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -7,7 +7,7 @@
7
  [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
8
  [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
9
 
10
- **Biological Reinforcement Learning from Human Feedback** β€” A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
11
 
12
  ## Highlights
13
 
@@ -16,7 +16,7 @@
16
  - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547)
17
  - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO
18
  - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage)
19
- - **Learns from 363 examples** β€” efficient domain adaptation from spaceflight transcriptomics data
20
 
21
  ## Key Results
22
 
@@ -197,8 +197,8 @@ reward = composer.score(question, response, ground_truth)
197
  Training data is derived from a 2x2x2 factorial transcriptomic study:
198
 
199
  - **Drug**: Kaempferol (KMP) vs Control
200
- - **Stressor 1**: Hindlimb Unloading (HU) β€” simulates microgravity
201
- - **Stressor 2**: Ionizing Radiation (IR) β€” simulates space radiation
202
  - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
203
 
204
  ### Training Example Types
@@ -289,15 +289,15 @@ BioRLHF/
289
 
290
  ## Key Learnings for AI Safety
291
 
292
- 1. **Honesty is trainable** β€” Models can learn appropriate epistemic humility
293
- 2. **Domain grounding matters** β€” Anchoring to experimental truth prevents hallucination
294
- 3. **Multi-reward > single reward** β€” Decomposing correctness into verifiable dimensions improves learning signal
295
- 4. **Preference learning is fragile** β€” DPO can catastrophically forget domain knowledge
296
- 5. **Evaluation drives improvement** β€” Systematic testing reveals specific failure modes
297
 
298
  ## Related Projects
299
 
300
- - **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)** β€” 115-question benchmark for LLMs on spaceflight biomedical data
301
 
302
  ## Citation
303
 
@@ -319,7 +319,7 @@ Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for gui
319
 
320
  ## License
321
 
322
- This project is licensed under the MIT License β€” see the [LICENSE](LICENSE) file for details.
323
 
324
  ---
325
 
 
7
  [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
8
  [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
9
 
10
+ **Biological Reinforcement Learning from Human Feedback**: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
11
 
12
  ## Highlights
13
 
 
16
  - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547)
17
  - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO
18
  - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage)
19
+ - **Learns from 363 examples**: efficient domain adaptation from spaceflight transcriptomics data
20
 
21
  ## Key Results
22
 
 
197
  Training data is derived from a 2x2x2 factorial transcriptomic study:
198
 
199
  - **Drug**: Kaempferol (KMP) vs Control
200
+ - **Stressor 1**: Hindlimb Unloading (HU): simulates microgravity
201
+ - **Stressor 2**: Ionizing Radiation (IR): simulates space radiation
202
  - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
203
 
204
  ### Training Example Types
 
289
 
290
  ## Key Learnings for AI Safety
291
 
292
+ 1. **Honesty is trainable**: Models can learn appropriate epistemic humility
293
+ 2. **Domain grounding matters**: Anchoring to experimental truth prevents hallucination
294
+ 3. **Multi-reward > single reward**: Decomposing correctness into verifiable dimensions improves learning signal
295
+ 4. **Preference learning is fragile**: DPO can catastrophically forget domain knowledge
296
+ 5. **Evaluation drives improvement**: Systematic testing reveals specific failure modes
297
 
298
  ## Related Projects
299
 
300
+ - **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)**: 115-question benchmark for LLMs on spaceflight biomedical data
301
 
302
  ## Citation
303
 
 
319
 
320
  ## License
321
 
322
+ This project is licensed under the MIT License: see the [LICENSE](LICENSE) file for details.
323
 
324
  ---
325