|
# Mass Evaluations |
|
|
|
Simple benchmark tool for running predefined prompts through all checkpoints of a model. |
|
|
|
## Usage |
|
|
|
```bash |
|
python benchmark.py [model_name] [options] |
|
``` |
|
|
|
## Examples |
|
|
|
```bash |
|
# Benchmark all checkpoints of a model |
|
python benchmark.py pico-decoder-tiny-dolma5M-v1 |
|
|
|
# Specify custom output directory |
|
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/ |
|
|
|
# Use custom prompts file |
|
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json |
|
``` |
|
|
|
## Managing Prompts |
|
|
|
Prompts are stored in `prompts.json` as a simple array of strings: |
|
|
|
```json |
|
[ |
|
"Hello, how are you?", |
|
"Complete this story: Once upon a time", |
|
"What is the capital of France?" |
|
] |
|
``` |
|
|
|
### Adding New Prompts |
|
|
|
Simply edit `prompts.json` and add new prompt strings to the array. Super simple! |
|
|
|
## Features |
|
|
|
- **Auto-discovery**: Finds all `step_*` checkpoints automatically |
|
- **JSON-based prompts**: Easily customizable prompts via JSON file |
|
- **Readable output**: Markdown reports with clear structure |
|
- **Error handling**: Continues on failures, logs errors |
|
- **Progress tracking**: Shows real-time progress |
|
- **Metadata logging**: Includes generation time and parameters |
|
|
|
## Output |
|
|
|
Results are saved as markdown files in `results/` directory: |
|
``` |
|
results/ |
|
βββ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md |
|
βββ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md |
|
βββ ... |
|
``` |
|
|
|
## Predefined Prompts |
|
|
|
1. "Hello, how are you?" (conversational) |
|
2. "Complete this story: Once upon a time" (creative) |
|
3. "Explain quantum physics in simple terms" (explanatory) |
|
4. "Write a haiku about coding" (creative + structured) |
|
5. "What is the capital of France?" (factual) |
|
6. "The meaning of life is" (philosophical) |
|
7. "In the year 2050," (futuristic) |
|
8. "Python programming is" (technical) |
|
|