Mass Evaluations
Simple benchmark tool for running predefined prompts through all checkpoints of a model.
Usage
python benchmark.py [model_name] [options]
Examples
# Benchmark all checkpoints of a model
python benchmark.py pico-decoder-tiny-dolma5M-v1
# Specify custom output directory
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/
# Use custom prompts file
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json
Managing Prompts
Prompts are stored in prompts.json
as a simple array of strings:
[
"Hello, how are you?",
"Complete this story: Once upon a time",
"What is the capital of France?"
]
Adding New Prompts
Simply edit prompts.json
and add new prompt strings to the array. Super simple!
Features
- Auto-discovery: Finds all
step_*
checkpoints automatically - JSON-based prompts: Easily customizable prompts via JSON file
- Readable output: Markdown reports with clear structure
- Error handling: Continues on failures, logs errors
- Progress tracking: Shows real-time progress
- Metadata logging: Includes generation time and parameters
Output
Results are saved as markdown files in results/
directory:
results/
βββ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md
βββ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md
βββ ...
Predefined Prompts
- "Hello, how are you?" (conversational)
- "Complete this story: Once upon a time" (creative)
- "Explain quantum physics in simple terms" (explanatory)
- "Write a haiku about coding" (creative + structured)
- "What is the capital of France?" (factual)
- "The meaning of life is" (philosophical)
- "In the year 2050," (futuristic)
- "Python programming is" (technical)