ThomasTheMaker's picture
Upload folder using huggingface_hub
ac9d65f verified
# Mass Evaluations
Simple benchmark tool for running predefined prompts through all checkpoints of a model.
## Usage
```bash
python benchmark.py [model_name] [options]
```
## Examples
```bash
# Benchmark all checkpoints of a model
python benchmark.py pico-decoder-tiny-dolma5M-v1
# Specify custom output directory
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/
# Use custom prompts file
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json
```
## Managing Prompts
Prompts are stored in `prompts.json` as a simple array of strings:
```json
[
"Hello, how are you?",
"Complete this story: Once upon a time",
"What is the capital of France?"
]
```
### Adding New Prompts
Simply edit `prompts.json` and add new prompt strings to the array. Super simple!
## Features
- **Auto-discovery**: Finds all `step_*` checkpoints automatically
- **JSON-based prompts**: Easily customizable prompts via JSON file
- **Readable output**: Markdown reports with clear structure
- **Error handling**: Continues on failures, logs errors
- **Progress tracking**: Shows real-time progress
- **Metadata logging**: Includes generation time and parameters
## Output
Results are saved as markdown files in `results/` directory:
```
results/
β”œβ”€β”€ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md
β”œβ”€β”€ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md
└── ...
```
## Predefined Prompts
1. "Hello, how are you?" (conversational)
2. "Complete this story: Once upon a time" (creative)
3. "Explain quantum physics in simple terms" (explanatory)
4. "Write a haiku about coding" (creative + structured)
5. "What is the capital of France?" (factual)
6. "The meaning of life is" (philosophical)
7. "In the year 2050," (futuristic)
8. "Python programming is" (technical)