Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Mar 19
Post
2011
๐—จ๐˜€๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐Ÿง‘โ€โš–๏ธ ๐—ณ๐—ผ๐—ฟ ๐—ฎ๐—ป ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐˜๐—ถ๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

๐Ÿค” Then ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฎ๐˜€๐—ธ ๐—ฎ๐—ป๐—ผ๐˜๐—ต๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—  ๐˜๐—ผ ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป, by providing it relevant rating criteria? ๐Ÿ‘‰ This is the idea behind LLM-as-a-judge.

โš™๏ธ To implement a LLM judge correctly, you need a few tricks.
โœ… So ๐—œ'๐˜ƒ๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฝ๐˜‚๐—ฏ๐—น๐—ถ๐˜€๐—ต๐—ฒ๐—ฑ ๐—ฎ ๐—ป๐—ฒ๐˜„ ๐—ป๐—ผ๐˜๐—ฒ๐—ฏ๐—ผ๐—ผ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ต๐—ผ๐˜„ ๐˜๐—ผ ๐—ถ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ถ๐˜ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐—น๐˜† ๐—ถ๐—ป ๐—ผ๐˜‚๐—ฟ ๐—›๐˜‚๐—ด๐—ด๐—ถ๐—ป๐—ด ๐—™๐—ฎ๐—ฐ๐—ฒ ๐—–๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ! (you can run it instantly in Google Colab)
โžก๏ธ ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
โžก๏ธ ๐—”๐—น๐—น ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ๐˜€: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!

LLM-as-a-judge is really helpful when creating a DPO dataset as we can determine which response is better.

Really cool project!