lmsys/chatbot-arena-leaderboard · Synthetic evaluation hypothesis

Q*, MS Orca 2 - used synthetic data (gpt-4) for building efficient LLM's that outcompete larger ones.
given this fact,

Hypothesis:
Could we utilise an LLM for synthetic evaluation of other LLM's ?

-Perhaps chose the most dominant LLM (GPT-4) or screen the (generated challenge) message through several models.

Use GPT 4/human to Generate a challenge or request for 2 or more LLMS
Use GPT-4 to evaluate, rate and choose the best responses and create synthetic leaderboards.
Consider making the models critique and argue why X choice was superior, and even evaluating the reasoning.

-Consider finetuning for the purpose, or at least pre-prompting for specific functions.
-Use data collected for fine tuning new models????

downsides; setup, experimentation time, token cost.