refactor: update link to arena code

261234d verified about 2 months ago

7.78 kB

	---
	license: apache-2.0
	language:
	- ru
	- en
	base_model:
	- jinaai/jina-embeddings-v3
	---

	## JinaJudge: Proxy Judgement for Russian LLM Arena

	### Description
	This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

	---

	### Model Details

	This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model:
	- Increased amount of training data (not by much, approaximately 1.5x times).
	- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
	- Validation set was updated as well to exclude such errors.
	- Test set did not change (no bad judgements in that regard).

	---

	### Evaluation
	The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

	NOTE: values in parenthesis show relative improvement compared to previous model.

	Models evaluated:
	- gemma-2-9b-it-sppo-iter3
	- glm-4-9b-chat
	- gpt-3.5-turbo-1106
	- mistral-7b-instruct-v0.3
	- storm-7b

	Validation Performance (old validation set):
	- Accuracy: 79.97% (-0.78)
	- Precision: 78.25% (-0.31)
	- Recall: 78.25% (-1.23)
	- F1-score: 78.25% (-0.75)

	NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

	Validation Performance (new validation set):
	- Accuracy: 83.59% (+2.48)
	- Precision: 80.97% (+2.14)
	- Recall: 80.97% (+1.22)
	- F1-score: 80.97% (+1.77)

	For the test phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

	Test Performance:
	- Accuracy: 85.09% (+2.37)
	- Precision: 83.20% (+3.09)
	- Recall: 83.20% (+0.78)
	- F1-score: 83.20% (+2.02)

	---

	### Usage Example

	```python
	from transformers import AutoModel

	jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

	prompt_template = """
	<user prompt>
	{user_prompt}
	<end>
	<assistant A answer>
	{assistant_a}
	<end>
	<assistant B answer>
	{assistant_b}
	<end>
	""".strip()

	prompt = "your prompt"
	assistant_a = "assistant a response"
	assistant_b = "assistant b response"

	example = prompt_template.format(
	user_prompt=user_prompt,
	assistant_a=assistant_a,
	assistant_b=assistant_b,
	)

	judgement = jina([example])[0].argmax()

	judgement_map = {
	0: "A is better than B",
	1: "A == B",
	2: "B is better than A"
	}

	print(judgement_map[judgement])
	```

	---

	### Generated ranking

	The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena).
	All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090.


	\| Model \| Score \| 95% CI \| Average #Tokens \|
	\|--------------------------------------------------\|-------\|----------------------\|-----------------\|
	\| gpt-4-1106-preview \| 82.8 \| (-2.2, 2.3) \| 541 \|
	\| gpt-4o-mini \| 75.3 \| (-2.5, 2.9) \| 448 \|
	\| qwen-2.5-72b-it \| 73.1 \| (-3.4, 3.1) \| 557 \|
	\| gemma-2-9b-it-sppo-iter3 \| 70.6 \| (-3.9, 2.8) \| 509 \|
	\| gemma-2-27b-it \| 68.7 \| (-2.8, 3.8) \| 472 \|
	\| t-lite-instruct-0.1 \| 67.5 \| (-3.8, 3.8) \| 810 \|
	\| gemma-2-9b-it \| 67.0 \| (-3.7, 3.3) \| 459 \|
	\| suzume-llama-3-8B-multilingual-orpo-borda-half \| 62.4 \| (-3.5, 3.7) \| 682 \|
	\| glm-4-9b-chat \| 61.5 \| (-3.7, 3.0) \| 568 \|
	\| phi-3-medium-4k-instruct \| 60.4 \| (-3.5, 3.7) \| 566 \|
	\| sfr-iterative-dpo-llama-3-8b-r \| 57.2 \| (-3.9, 2.2) \| 516 \|
	\| c4ai-command-r-v01 \| 55.0 \| (-3.9, 3.1) \| 529 \|
	\| suzume-llama-3-8b-multilingual \| 51.9 \| (-2.8, 3.7) \| 641 \|
	\| mistral-nemo-instruct-2407 \| 51.9 \| (-3.8, 3.7) \| 403 \|
	\| yandex_gpt_pro \| 50.3 \| (-3.4, 3.1) \| 345 \|
	\| gpt-3.5-turbo-0125 \| 50.0 \| (0.0, 0.0) \| 220 \|
	\| hermes-2-theta-llama-3-8b \| 49.3 \| (-3.4, 3.9) \| 485 \|
	\| starling-lm-7b-beta \| 48.3 \| (-3.8, 4.0) \| 629 \|
	\| llama-3-8b-saiga-suzume-ties \| 47.9 \| (-3.9, 5.0) \| 763 \|
	\| llama-3-smaug-8b \| 47.6 \| (-3.6, 3.1) \| 524 \|
	\| vikhr-it-5.4-fp16-orpo-v2 \| 46.8 \| (-2.5, 2.7) \| 379 \|
	\| aya-23-8b \| 46.1 \| (-3.9, 3.9) \| 554 \|
	\| saiga_llama3_8b_v6 \| 44.8 \| (-3.4, 3.3) \| 471 \|
	\| qwen2-7b-instruct \| 43.6 \| (-3.0, 2.7) \| 340 \|
	\| vikhr-it-5.2-fp16-cp \| 43.6 \| (-4.1, 3.3) \| 543 \|
	\| openchat-3.5-0106 \| 42.8 \| (-3.9, 3.3) \| 492 \|
	\| kolibri-mistral-0427-upd \| 42.3 \| (-4.2, 3.2) \| 551 \|
	\| paralex-llama-3-8b-sft \| 41.8 \| (-3.2, 3.7) \| 688 \|
	\| llama-3-instruct-8b-sppo-iter3 \| 41.7 \| (-3.4, 3.3) \| 502 \|
	\| gpt-3.5-turbo-1106 \| 41.5 \| (-2.9, 2.1) \| 191 \|
	\| mistral-7b-instruct-v0.3 \| 41.1 \| (-4.3, 3.5) \| 469 \|
	\| gigachat_pro \| 40.9 \| (-3.4, 3.6) \| 294 \|
	\| openchat-3.6-8b-20240522 \| 39.1 \| (-3.2, 4.1) \| 428 \|
	\| vikhr-it-5.3-fp16-32k \| 38.8 \| (-3.5, 3.3) \| 519 \|
	\| hermes-2-pro-llama-3-8b \| 38.4 \| (-3.2, 3.1) \| 463 \|
	\| kolibri-vikhr-mistral-0427 \| 34.5 \| (-2.9, 3.5) \| 489 \|
	\| vikhr-it-5.3-fp16 \| 33.5 \| (-3.5, 3.8) \| 523 \|
	\| llama-3-instruct-8b-simpo \| 32.7 \| (-3.9, 3.6) \| 417 \|
	\| meta-llama-3-8b-instruct \| 32.1 \| (-3.4, 3.3) \| 450 \|
	\| neural-chat-7b-v3-3 \| 25.9 \| (-2.7, 3.6) \| 927 \|
	\| gigachat_lite \| 25.4 \| (-2.8, 2.5) \| 276 \|
	\| snorkel-mistral-pairrm-dpo \| 10.3 \| (-2.0, 2.3) \| 773 \|
	\| storm-7b \| 3.7 \| (-1.3, 1.6) \| 419 \|

	---
	license: apache-2.0
	language:
	- ru
	- en
	base_model:
	- jinaai/jina-embeddings-v3
	---

	## JinaJudge: Proxy Judgement for Russian LLM Arena

	### Description
	This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

	---

	### Model Details

	This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model:
	- Increased amount of training data (not by much, approaximately 1.5x times).
	- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
	- Validation set was updated as well to exclude such errors.
	- Test set did not change (no bad judgements in that regard).

	---

	### Evaluation
	The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

	NOTE: values in parenthesis show relative improvement compared to previous model.

	Models evaluated:
	- gemma-2-9b-it-sppo-iter3
	- glm-4-9b-chat
	- gpt-3.5-turbo-1106
	- mistral-7b-instruct-v0.3
	- storm-7b

	Validation Performance (old validation set):
	- Accuracy: 79.97% (-0.78)
	- Precision: 78.25% (-0.31)
	- Recall: 78.25% (-1.23)
	- F1-score: 78.25% (-0.75)

	NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

	Validation Performance (new validation set):
	- Accuracy: 83.59% (+2.48)
	- Precision: 80.97% (+2.14)
	- Recall: 80.97% (+1.22)
	- F1-score: 80.97% (+1.77)

	For the test phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

	Test Performance:
	- Accuracy: 85.09% (+2.37)
	- Precision: 83.20% (+3.09)
	- Recall: 83.20% (+0.78)
	- F1-score: 83.20% (+2.02)

	---

	### Usage Example

	```python
	from transformers import AutoModel

	jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

	prompt_template = """
	<user prompt>
	{user_prompt}
	<end>
	<assistant A answer>
	{assistant_a}
	<end>
	<assistant B answer>
	{assistant_b}
	<end>
	""".strip()

	prompt = "your prompt"
	assistant_a = "assistant a response"
	assistant_b = "assistant b response"

	example = prompt_template.format(
	user_prompt=user_prompt,
	assistant_a=assistant_a,
	assistant_b=assistant_b,
	)

	judgement = jina([example])[0].argmax()

	judgement_map = {
	0: "A is better than B",
	1: "A == B",
	2: "B is better than A"
	}

	print(judgement_map[judgement])
	```

	---

	### Generated ranking

	The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena).
	All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090.


	\| Model \| Score \| 95% CI \| Average #Tokens \|
	\|--------------------------------------------------\|-------\|----------------------\|-----------------\|
	\| gpt-4-1106-preview \| 82.8 \| (-2.2, 2.3) \| 541 \|
	\| gpt-4o-mini \| 75.3 \| (-2.5, 2.9) \| 448 \|
	\| qwen-2.5-72b-it \| 73.1 \| (-3.4, 3.1) \| 557 \|
	\| gemma-2-9b-it-sppo-iter3 \| 70.6 \| (-3.9, 2.8) \| 509 \|
	\| gemma-2-27b-it \| 68.7 \| (-2.8, 3.8) \| 472 \|
	\| t-lite-instruct-0.1 \| 67.5 \| (-3.8, 3.8) \| 810 \|
	\| gemma-2-9b-it \| 67.0 \| (-3.7, 3.3) \| 459 \|
	\| suzume-llama-3-8B-multilingual-orpo-borda-half \| 62.4 \| (-3.5, 3.7) \| 682 \|
	\| glm-4-9b-chat \| 61.5 \| (-3.7, 3.0) \| 568 \|
	\| phi-3-medium-4k-instruct \| 60.4 \| (-3.5, 3.7) \| 566 \|
	\| sfr-iterative-dpo-llama-3-8b-r \| 57.2 \| (-3.9, 2.2) \| 516 \|
	\| c4ai-command-r-v01 \| 55.0 \| (-3.9, 3.1) \| 529 \|
	\| suzume-llama-3-8b-multilingual \| 51.9 \| (-2.8, 3.7) \| 641 \|
	\| mistral-nemo-instruct-2407 \| 51.9 \| (-3.8, 3.7) \| 403 \|
	\| yandex_gpt_pro \| 50.3 \| (-3.4, 3.1) \| 345 \|
	\| gpt-3.5-turbo-0125 \| 50.0 \| (0.0, 0.0) \| 220 \|
	\| hermes-2-theta-llama-3-8b \| 49.3 \| (-3.4, 3.9) \| 485 \|
	\| starling-lm-7b-beta \| 48.3 \| (-3.8, 4.0) \| 629 \|
	\| llama-3-8b-saiga-suzume-ties \| 47.9 \| (-3.9, 5.0) \| 763 \|
	\| llama-3-smaug-8b \| 47.6 \| (-3.6, 3.1) \| 524 \|
	\| vikhr-it-5.4-fp16-orpo-v2 \| 46.8 \| (-2.5, 2.7) \| 379 \|
	\| aya-23-8b \| 46.1 \| (-3.9, 3.9) \| 554 \|
	\| saiga_llama3_8b_v6 \| 44.8 \| (-3.4, 3.3) \| 471 \|
	\| qwen2-7b-instruct \| 43.6 \| (-3.0, 2.7) \| 340 \|
	\| vikhr-it-5.2-fp16-cp \| 43.6 \| (-4.1, 3.3) \| 543 \|
	\| openchat-3.5-0106 \| 42.8 \| (-3.9, 3.3) \| 492 \|
	\| kolibri-mistral-0427-upd \| 42.3 \| (-4.2, 3.2) \| 551 \|
	\| paralex-llama-3-8b-sft \| 41.8 \| (-3.2, 3.7) \| 688 \|
	\| llama-3-instruct-8b-sppo-iter3 \| 41.7 \| (-3.4, 3.3) \| 502 \|
	\| gpt-3.5-turbo-1106 \| 41.5 \| (-2.9, 2.1) \| 191 \|
	\| mistral-7b-instruct-v0.3 \| 41.1 \| (-4.3, 3.5) \| 469 \|
	\| gigachat_pro \| 40.9 \| (-3.4, 3.6) \| 294 \|
	\| openchat-3.6-8b-20240522 \| 39.1 \| (-3.2, 4.1) \| 428 \|
	\| vikhr-it-5.3-fp16-32k \| 38.8 \| (-3.5, 3.3) \| 519 \|
	\| hermes-2-pro-llama-3-8b \| 38.4 \| (-3.2, 3.1) \| 463 \|
	\| kolibri-vikhr-mistral-0427 \| 34.5 \| (-2.9, 3.5) \| 489 \|
	\| vikhr-it-5.3-fp16 \| 33.5 \| (-3.5, 3.8) \| 523 \|
	\| llama-3-instruct-8b-simpo \| 32.7 \| (-3.9, 3.6) \| 417 \|
	\| meta-llama-3-8b-instruct \| 32.1 \| (-3.4, 3.3) \| 450 \|
	\| neural-chat-7b-v3-3 \| 25.9 \| (-2.7, 3.6) \| 927 \|
	\| gigachat_lite \| 25.4 \| (-2.8, 2.5) \| 276 \|
	\| snorkel-mistral-pairrm-dpo \| 10.3 \| (-2.0, 2.3) \| 773 \|
	\| storm-7b \| 3.7 \| (-1.3, 1.6) \| 419 \|