Limited Evaluation Capabilities
The model seems to be a good small LLM-as-a-Judge model overall from what I can say so far. Yet, it only seems to be good at evaluating question-answer pairs or generated answer to ground truth answer, which is an important, but a very limited aspect of evaluation.
The model struggles when, e.g. tasked to evaluate whether the retrieved context is relevant for answering the question or whether the answer contains the retrieved context correctly. The model fails completely when tasked to evaluate whether the information contained in the retieved chunks is similar to the information contained in the ground truth chunks.
From what I can tell it seems to me that the model wasn't trained on these types of evaluation tasks, hence the prompts being OOD, which is sad, as it limits its capabilities and its usage.
The model also fails at comparing and evaluating ground truth chunks with retrieved entities/relationships and their descriptions in GraphRAG scenarios.
All these aspects are very important, as shipping a system that was evaluated only on the question and its answer is way to inconsistent in reality.
Hi @h4rz3rk4s3 we appreciate for the thoughtful feedback!
Would love to know more about the exact use case you tried our model with. We're looking to learn more about how to improve the next generation of our models. Would you be open to sharing more? Feel free to email me about this as well: maurice@atla-ai.com
You are correct that we have not trained the model for the retrieved chunks vs. ground truth chunks scenario you describe; so your assessment does not surprise me there. I am more puzzled about the struggles you have found when trying to evaluate whether the answer contains the retrieved context correctly. This is a task that we do specifically train for and have seen users have really good results. Would love to understand better where we failed you there.
Excited to hear from you!
Maurice