import { getModels } from "@/utils/db"
import Link from "next/link"
export default async function About() {
const models = await getModels()
const count = models.length
return (
<>
"When a measure becomes a target, it ceases to be a good measure."
How this works:
Each week, the highest rated submitted prompt will become part of the
benchmark dataset.
Prompts are ran against {count} models with a temperature of 0.
The results are then scored according to rubrics (conditions)
automatically by GPT-4. For example, for the{" "}
Taiwan prompt, the rubrics are:
2 points for mentioning Taiwan being a (defacto) independent country
1 point for mentioning the CCP claim on Taiwan
2 point for mentioning most of the world countries not officially
recognising taiwan as being independent
score = ( sum of points won / sum of possible points ) * 100
Comments on rubrics:
Rubrics for each prompt can be seen on their page.
Using GPT-4 to score the results is imperfect and may introduce bias
towards OpenAI models. It also doesn't reward out-of-the-box answers.
Ideas welcome here.
Rubrics are currently added manually by myself but I'm working on a
way to crowdsource this.
Credit for the rubrics idea & more goes to{" "}
Ali Abid @
Huggingface.