BioReasoning Data Curation
Jupyter notebooks for processing genetic variant data and creating ML datasets for biological reasoning tasks.
Notebooks
Core Analysis
BioReasoning_DataCuration_KEGG.ipynb- KEGG pathway analysis with Claude APIClinvar_Coding.ipynb- ClinVar variant processing and gene mappingClinvar_SNV_Non_SNV.ipynb- SNV/structural variant datasets with VEP annotations
KEGG Pipeline
KEGG_Data_1.ipynb- KEGG network data processing and variant identificationKEGG_Data_2.ipynb- Variant parsing and sequence generationKEGG_Data_3.ipynb- Final ML dataset creation with Q&A pairs
Variant Prediction
VEP.ipynb- Variant effect prediction datasets (ClinVar, OMIM, eQTL)
Setup
brew install brewsci/bio/edirect # For ClinVar (macOS)
export ANTHROPIC_API_KEY="your-key" # For KEGG analysis
Usage
Each notebook has a configuration section - update paths/keys as needed, then run sequentially.
Key Outputs:
- KEGG biological reasoning datasets
- ClinVar variant-disease associations
- VEP prediction task datasets
- Genomic sequences with variant context