open-llm-leaderboard/open_llm_leaderboard · How to evaluate hellaswag with LLM?

Jun 13, 2023

Hi,
I'm trying to evaluate hellaswag with LLM.
But, I have some question when we use Autoregressive model for this dataset.
hellaswag dataset is finding proper endings for each ctx.

I was thinking that, putting ctx as input to a Language model, and generate next sentence. Then, compare similarity with generated sentence and each endings, and find most similar ending.

But this process doesn't worked well. (very low accuracy)
Anyone has idea how to evaluate hellaswag dataset with Language model?

clefourrier

Open LLM Leaderboard org Jun 16, 2023

Hi!
In general, you can use the Eleuther AI Harness almost plug and play to evaluate any HF model on the hub, and I suggest you look at how they manage evaluation to better understand how this works 🤗

clefourrier changed discussion status to closed Jun 16, 2023