How to evaluate hellaswag with LLM?

#66
by ryusangwon - opened

Hi,
I'm trying to evaluate hellaswag with LLM.
But, I have some question when we use Autoregressive model for this dataset.
hellaswag dataset is finding proper endings for each ctx.

I was thinking that, putting ctx as input to a Language model, and generate next sentence. Then, compare similarity with generated sentence and each endings, and find most similar ending.

But this process doesn't worked well. (very low accuracy)
Anyone has idea how to evaluate hellaswag dataset with Language model?

Open LLM Leaderboard org

Hi!
In general, you can use the Eleuther AI Harness almost plug and play to evaluate any HF model on the hub, and I suggest you look at how they manage evaluation to better understand how this works 🤗

clefourrier changed discussion status to closed

Sign up or log in to comment