|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
- en |
|
base_model: |
|
- jinaai/jina-embeddings-v3 |
|
--- |
|
|
|
## **JinaJudge: Proxy Judgement for Russian LLM Arena** |
|
|
|
### **Description** |
|
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models. |
|
|
|
--- |
|
|
|
### **Model Details** |
|
|
|
This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model: |
|
- Increased amount of training data (not by much, approaximately 1.5x times). |
|
- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones. |
|
- Validation set was updated as well to exclude such errors. |
|
- Test set did not change (no bad judgements in that regard). |
|
|
|
--- |
|
|
|
### **Evaluation** |
|
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training. |
|
|
|
NOTE: values in parenthesis show relative improvement compared to previous model. |
|
|
|
**Models evaluated**: |
|
- **gemma-2-9b-it-sppo-iter3** |
|
- **glm-4-9b-chat** |
|
- **gpt-3.5-turbo-1106** |
|
- **mistral-7b-instruct-v0.3** |
|
- **storm-7b** |
|
|
|
**Validation Performance (old validation set)**: |
|
- **Accuracy**: 79.97% (-0.78) |
|
- **Precision**: 78.25% (-0.31) |
|
- **Recall**: 78.25% (-1.23) |
|
- **F1-score**: 78.25% (-0.75) |
|
|
|
NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else) |
|
|
|
**Validation Performance (new validation set)**: |
|
- **Accuracy**: 83.59% (+2.48) |
|
- **Precision**: 80.97% (+2.14) |
|
- **Recall**: 80.97% (+1.22) |
|
- **F1-score**: 80.97% (+1.77) |
|
|
|
For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model. |
|
|
|
**Test Performance**: |
|
- **Accuracy**: 85.09% (+2.37) |
|
- **Precision**: 83.20% (+3.09) |
|
- **Recall**: 83.20% (+0.78) |
|
- **F1-score**: 83.20% (+2.02) |
|
|
|
--- |
|
|
|
### **Usage Example** |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True) |
|
|
|
prompt_template = """ |
|
<user prompt> |
|
{user_prompt} |
|
<end> |
|
<assistant A answer> |
|
{assistant_a} |
|
<end> |
|
<assistant B answer> |
|
{assistant_b} |
|
<end> |
|
""".strip() |
|
|
|
prompt = "your prompt" |
|
assistant_a = "assistant a response" |
|
assistant_b = "assistant b response" |
|
|
|
example = prompt_template.format( |
|
user_prompt=user_prompt, |
|
assistant_a=assistant_a, |
|
assistant_b=assistant_b, |
|
) |
|
|
|
judgement = jina([example])[0].argmax() |
|
|
|
judgement_map = { |
|
0: "A is better than B", |
|
1: "A == B", |
|
2: "B is better than A" |
|
} |
|
|
|
print(judgement_map[judgement]) |
|
``` |
|
|
|
--- |
|
|
|
### **Generated ranking** |
|
|
|
The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/VikhrModels/ru_llm_arena). |
|
All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090. |
|
|
|
|
|
| Model | Score | 95% CI | Average #Tokens | |
|
|--------------------------------------------------|-------|----------------------|-----------------| |
|
| gpt-4-1106-preview | 82.8 | (-2.2, 2.3) | 541 | |
|
| gpt-4o-mini | 75.3 | (-2.5, 2.9) | 448 | |
|
| qwen-2.5-72b-it | 73.1 | (-3.4, 3.1) | 557 | |
|
| gemma-2-9b-it-sppo-iter3 | 70.6 | (-3.9, 2.8) | 509 | |
|
| gemma-2-27b-it | 68.7 | (-2.8, 3.8) | 472 | |
|
| t-lite-instruct-0.1 | 67.5 | (-3.8, 3.8) | 810 | |
|
| gemma-2-9b-it | 67.0 | (-3.7, 3.3) | 459 | |
|
| suzume-llama-3-8B-multilingual-orpo-borda-half | 62.4 | (-3.5, 3.7) | 682 | |
|
| glm-4-9b-chat | 61.5 | (-3.7, 3.0) | 568 | |
|
| phi-3-medium-4k-instruct | 60.4 | (-3.5, 3.7) | 566 | |
|
| sfr-iterative-dpo-llama-3-8b-r | 57.2 | (-3.9, 2.2) | 516 | |
|
| c4ai-command-r-v01 | 55.0 | (-3.9, 3.1) | 529 | |
|
| suzume-llama-3-8b-multilingual | 51.9 | (-2.8, 3.7) | 641 | |
|
| mistral-nemo-instruct-2407 | 51.9 | (-3.8, 3.7) | 403 | |
|
| yandex_gpt_pro | 50.3 | (-3.4, 3.1) | 345 | |
|
| gpt-3.5-turbo-0125 | 50.0 | (0.0, 0.0) | 220 | |
|
| hermes-2-theta-llama-3-8b | 49.3 | (-3.4, 3.9) | 485 | |
|
| starling-lm-7b-beta | 48.3 | (-3.8, 4.0) | 629 | |
|
| llama-3-8b-saiga-suzume-ties | 47.9 | (-3.9, 5.0) | 763 | |
|
| llama-3-smaug-8b | 47.6 | (-3.6, 3.1) | 524 | |
|
| vikhr-it-5.4-fp16-orpo-v2 | 46.8 | (-2.5, 2.7) | 379 | |
|
| aya-23-8b | 46.1 | (-3.9, 3.9) | 554 | |
|
| saiga_llama3_8b_v6 | 44.8 | (-3.4, 3.3) | 471 | |
|
| qwen2-7b-instruct | 43.6 | (-3.0, 2.7) | 340 | |
|
| vikhr-it-5.2-fp16-cp | 43.6 | (-4.1, 3.3) | 543 | |
|
| openchat-3.5-0106 | 42.8 | (-3.9, 3.3) | 492 | |
|
| kolibri-mistral-0427-upd | 42.3 | (-4.2, 3.2) | 551 | |
|
| paralex-llama-3-8b-sft | 41.8 | (-3.2, 3.7) | 688 | |
|
| llama-3-instruct-8b-sppo-iter3 | 41.7 | (-3.4, 3.3) | 502 | |
|
| gpt-3.5-turbo-1106 | 41.5 | (-2.9, 2.1) | 191 | |
|
| mistral-7b-instruct-v0.3 | 41.1 | (-4.3, 3.5) | 469 | |
|
| gigachat_pro | 40.9 | (-3.4, 3.6) | 294 | |
|
| openchat-3.6-8b-20240522 | 39.1 | (-3.2, 4.1) | 428 | |
|
| vikhr-it-5.3-fp16-32k | 38.8 | (-3.5, 3.3) | 519 | |
|
| hermes-2-pro-llama-3-8b | 38.4 | (-3.2, 3.1) | 463 | |
|
| kolibri-vikhr-mistral-0427 | 34.5 | (-2.9, 3.5) | 489 | |
|
| vikhr-it-5.3-fp16 | 33.5 | (-3.5, 3.8) | 523 | |
|
| llama-3-instruct-8b-simpo | 32.7 | (-3.9, 3.6) | 417 | |
|
| meta-llama-3-8b-instruct | 32.1 | (-3.4, 3.3) | 450 | |
|
| neural-chat-7b-v3-3 | 25.9 | (-2.7, 3.6) | 927 | |
|
| gigachat_lite | 25.4 | (-2.8, 2.5) | 276 | |
|
| snorkel-mistral-pairrm-dpo | 10.3 | (-2.0, 2.3) | 773 | |
|
| storm-7b | 3.7 | (-1.3, 1.6) | 419 | |
|
|