Mass Evaluations

Simple benchmark tool for running predefined prompts through all checkpoints of a model.

Usage

python benchmark.py [model_name] [options]

Examples

# Benchmark all checkpoints of a model
python benchmark.py pico-decoder-tiny-dolma5M-v1

# Specify custom output directory
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/

# Use custom prompts file
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json

Managing Prompts

Prompts are stored in prompts.json as a simple array of strings:

[
  "Hello, how are you?",
  "Complete this story: Once upon a time",
  "What is the capital of France?"
]

Adding New Prompts

Simply edit prompts.json and add new prompt strings to the array. Super simple!

Features

Auto-discovery: Finds all step_* checkpoints automatically
JSON-based prompts: Easily customizable prompts via JSON file
Readable output: Markdown reports with clear structure
Error handling: Continues on failures, logs errors
Progress tracking: Shows real-time progress
Metadata logging: Includes generation time and parameters

Output

Results are saved as markdown files in results/ directory:

results/
├── pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md
├── pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md
└── ...

Predefined Prompts

"Hello, how are you?" (conversational)
"Complete this story: Once upon a time" (creative)
"Explain quantum physics in simple terms" (explanatory)
"Write a haiku about coding" (creative + structured)
"What is the capital of France?" (factual)
"The meaning of life is" (philosophical)
"In the year 2050," (futuristic)
"Python programming is" (technical)