A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

Published January 12, 2024

Update on GitHub

Upvote

ofermend Ofer Mendelevitch

guest

minseokbae Minseok Bae

guest

clefourrier Clémentine Fourrier

Hugging Face’s Open LLM Leaderboard (originally created by Ed Beeching and Lewis Tunstall, and maintained by Nathan Habib and Clémentine Fourrier) is well known for tracking the performance of open source LLMs, comparing their performance in a variety of tasks, such as TruthfulQA or HellaSwag.

This has been of tremendous value to the open-source community, as it provides a way for practitioners to keep track of the best open-source models.

In late 2023, at Vectara we introduced the Hughes Hallucination Evaluation Model (HHEM), an open-source model for measuring the extent to which an LLM hallucinates (generates text that is nonsensical or unfaithful to the provided source content). Covering both open source models like Llama 2 or Mistral 7B, as well as commercial models like OpenAI’s GPT-4, Anthropic Claude, or Google’s Gemini, this model highlighted the stark differences that currently exist between models in terms of their likelihood to hallucinate.

As we continue to add new models to HHEM, we were looking for an open-source solution to manage and update the HHEM leaderboard.

Quite recently, the Hugging Face leaderboard team released leaderboard templates (here and here). These are lightweight versions of the Open LLM Leaderboard itself, which are both open-source and simpler to use than the original code.

Today we’re happy to announce the release of the new HHEM leaderboard, powered by the HF leaderboard template.

Vectara’s Hughes Hallucination Evaluation Model (HHEM)

The Hughes Hallucination Evaluation Model (HHEM) Leaderboard is dedicated to assessing the frequency of hallucinations in document summaries generated by Large Language Models (LLMs) such as GPT-4, Google Gemini or Meta’s Llama 2. To use it you can follow the instructions here.

By doing an open-source release of this model, we at Vectara aim to democratize the evaluation of LLM hallucinations, driving awareness to the differences that exist in LLM performance in terms of propensity to hallucinate.

Our initial release of HHEM was a Huggingface model alongside a Github repository, but we quickly realized that we needed a mechanism to allow new types of models to be evaluated. Using the HF leaderboard code template, we were able to quickly put together a new leaderboard that allows for dynamic updates, and we encourage the LLM community to submit new relevant models for HHEM evaluation.

On a meaningful side note to us here at Vectara, the HHEM was named after our peer Simon Hughes, who passed away in Nov. of 2023 without notice of natural causes; we decided to name it in his honor due to his lasting legacy in this space.

Setting up HHEM with the LLM leaderboard template

To set up the Vectara HHEM leaderboard, we had to follow a few steps, adjusting the HF leaderboard template code to our needs:

After cloning the space repository to our own organization, we created two associated datasets: “requests” and “results”; these datasets maintain the requests submitted by users for new LLMs to evaluate, and the results of such evaluations, respectively.
We populated the results dataset with existing results from the initial launch, and updated the “About” and “Citations” sections.

For a simple leaderboard, where evaluations results are pushed by your backend to the results dataset, that’s all you need!

As our evaluation is more complex, we then customized the source code to fit the needs of the HHEM leaderboard - here are the details:

leaderboard/src/backend/model_operations.py: This file contains two primary classes - SummaryGenerator and EvaluationModel. a. The SummaryGenerator generates summaries based on the HHEM private evaluation dataset and calculates metrics like Answer Rate and Average Summary Length. b. The EvaluationModel loads our proprietary Hughes Hallucination Evaluation Model (HHEM) to assess these summaries, yielding metrics such as Factual Consistency Rate and Hallucination Rate.
leaderboard/src/backend/evaluate_model.py: defines the Evaluator class which utilizes both SummaryGenerator and EvaluationModel to compute and return results in JSON format.
leaderboard/src/backend/run_eval_suite.py: contains a function run_evaluation that leverages Evaluator to obtain and upload evaluation results to the results dataset mentioned above, causing them to appear in the leaderboard.
leaderboard/main_backend.py: Manages pending evaluation requests and executes auto evaluations using aforementioned classes and functions. It also includes an option for users to replicate our evaluation results.

The final source code is available in the Files tab of our HHEM leaderboard repository. With all these changes, we now have the evaluation pipeline ready to go, and easily deployable as a Huggingface Space.

Summary

The HHEM is a novel classification model that can be used to evaluate the extent to which LLMs hallucinate. Our use of the Hugging Face leaderboard template provided much needed support for a common need for any leaderboard: the ability to manage the submission of new model evaluation requests, and the update of the leaderboard as new results emerge.

Big kudos to the Hugging Face team for making this valuable framework open-source, and supporting the Vectara team in the implementation. We expect this code to be reused by other community members who aim to publish other types of LLM leaderboards.

If you want to contribute to the HHEM with new models, please submit it on the leaderboard - we very much appreciate any suggestions for new models to evaluate.

And if you have any questions about the Hugging Face LLM front-end or Vectara, please feel free to reach out in the Vectara or Huggingface forums.

Evaluating Audio Reasoning with Big Bench Audio

By December 20, 2024 guest • 14

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

By December 4, 2024 guest • 29

Upvote