--- license: apache-2.0 language: - ru - en base_model: - jinaai/jina-embeddings-v3 --- ## **JinaJudge: Proxy Judgement for Russian LLM Arena** ### **Description** This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models. --- ### **Model Details** This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model: - Increased amount of training data (not by much, approaximately 1.5x times). - Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones. - Validation set was updated as well to exclude such errors. - Test set did not change (no bad judgements in that regard). --- ### **Evaluation** The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training. NOTE: values in parenthesis show relative improvement compared to previous model. **Models evaluated**: - **gemma-2-9b-it-sppo-iter3** - **glm-4-9b-chat** - **gpt-3.5-turbo-1106** - **mistral-7b-instruct-v0.3** - **storm-7b** **Validation Performance (old validation set)**: - **Accuracy**: 79.97% (-0.78) - **Precision**: 78.25% (-0.31) - **Recall**: 78.25% (-1.23) - **F1-score**: 78.25% (-0.75) NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else) **Validation Performance (new validation set)**: - **Accuracy**: 83.59% (+2.48) - **Precision**: 80.97% (+2.14) - **Recall**: 80.97% (+1.22) - **F1-score**: 80.97% (+1.77) For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model. **Test Performance**: - **Accuracy**: 85.09% (+2.37) - **Precision**: 83.20% (+3.09) - **Recall**: 83.20% (+0.78) - **F1-score**: 83.20% (+2.02) --- ### **Usage Example** ```python from transformers import AutoModel jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True) prompt_template = """ {user_prompt} {assistant_a} {assistant_b} """.strip() prompt = "your prompt" assistant_a = "assistant a response" assistant_b = "assistant b response" example = prompt_template.format( user_prompt=user_prompt, assistant_a=assistant_a, assistant_b=assistant_b, ) judgement = jina([example])[0].argmax() judgement_map = { 0: "A is better than B", 1: "A == B", 2: "B is better than A" } print(judgement_map[judgement]) ```