Synthetic evaluation hypothesis

Q*, MS Orca 2 - used synthetic data (gpt-4) for building efficient LLM's that outcompete larger ones.
given this fact,

Could we utilise an LLM for synthetic evaluation of other LLM's ?

-Perhaps chose the most dominant LLM (GPT-4) or screen the (generated challenge) message through several models.

  • Use GPT 4/human to Generate a challenge or request for 2 or more LLMS
  • Use GPT-4 to evaluate, rate and choose the best responses and create synthetic leaderboards.
  • Consider making the models critique and argue why X choice was superior, and even evaluating the reasoning.

-Consider finetuning for the purpose, or at least pre-prompting for specific functions.
-Use data collected for fine tuning new models????

downsides; setup, experimentation time, token cost.

I apologise I wrote this on a whim before I had read the actual study...

