See curate_dataset.py in this repository for the full dataset curation pipeline.
The dataset curation script combines:
- AlicanKiraz0/All-CVE-Records-Training-Dataset (10K samples)
- m-a-p/Code-Feedback (5K samples)
- nvidia/OpenCodeReasoning (5K samples)
- Synthetic cybersecurity examples (JSON output, AST, GDB, ROP)
Run with: python curate_dataset.py