import { getModels } from "@/utils/db" import Link from "next/link" export default async function About() { const models = await getModels() const count = models.length return ( <>

"When a measure becomes a target, it ceases to be a good measure."

How this works:

Each week, the highest rated submitted prompt will become part of the benchmark dataset.
Prompts are ran against {count} models with a temperature of 0.
The results are then scored according to rubrics (conditions) automatically by GPT-4. For example, for the{" "} Taiwan prompt, the rubrics are:

2 points for mentioning Taiwan being a (defacto) independent country
1 point for mentioning the CCP claim on Taiwan
2 point for mentioning most of the world countries not officially recognising taiwan as being independent

score = ( sum of points won / sum of possible points ) * 100

Comments on rubrics:

Rubrics for each prompt can be seen on their page.
Using GPT-4 to score the results is imperfect and may introduce bias towards OpenAI models. It also doesn't reward out-of-the-box answers. Ideas welcome here.
Rubrics are currently added manually by myself but I'm working on a way to crowdsource this.
Credit for the rubrics idea & more goes to{" "} Ali Abid @ Huggingface.

Notes

This is open-source on{" "} GitHub {" "} and{" "} Huggingface
I used a temperature of 0 and a max token limit of 600 (that's why a lot of answers are cropped). The rest are default settings.
I made this with a mix of APIs from OpenRouter, TogetherAI, OpenAI, Anthropic, Cohere, Aleph Alpha & AI21.
This is imperfect. Not all prompts are good for grading. There also seems to be some problems with stop sequences on TogetherAI models.
Feedback, ideas or say hi: vince [at] llmonitor.com
Shameless plug: I'm building an{" "} open-source observability tool for AI devs.

Edit: as this got popular, I added an email form to receive notifications for future benchmark results: (no spam, max 1 email per month)

) }