Kunoichi-DPO-7B / README.md

Adding Evaluation Results

bb24b30 verified 4 months ago

7.32 kB

	---
	license: cc-by-nc-4.0
	model-index:
	- name: Kunoichi-DPO-7B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 69.62
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 87.14
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.79
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 67.31
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 80.58
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 63.99
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
	name: Open LLM Leaderboard
	---

	![image/png](https://huggingface.co/SanjiWatsuki/Kunoichi-DPO-7B/resolve/main/assets/kunoichi2.png)

	<!-- description start -->
	## Description

	This repository hosts Kunoichi-DPO-7B, a DPO finetune using Intel's Orca pairs with the Alpaca template on Kunoichi-7B. This model is targeted at general use. In my testing, it has stronger reasoning and instruction following capabilities than Kunoichi-7B but it may be worse for roleplaying purposes due to the alignment from the Orca dataset.

	This model is undergoing benchmark testing and I will update the model page with the finalized results.

	\| Model \| MT Bench \| EQ Bench \| MMLU \| Logic Test \|
	\|----------------------\|----------\|----------\|---------\|-------------\|
	\| GPT-4-Turbo \| 9.32 \| - \| - \| - \|
	\| GPT-4 \| 8.99 \| 62.52 \| 86.4 \| 0.86 \|
	\| Kunoichi-DPO-7B \| 8.29 \| 41.60 \| - \| 0.59 \|
	\| Kunoichi-7B \| 8.14 \| 44.32 \| 64.9 \| 0.58 \|
	\| Starling-7B \| 8.09 \| - \| 63.9 \| 0.51 \|
	\| Claude-2 \| 8.06 \| 52.14 \| 78.5 \| - \|
	\| Silicon-Maid-7B \| 7.96 \| 40.44 \| 64.7 \| 0.54 \|
	\| Loyal-Macaroni-Maid-7B \| 7.95 \| 38.66 \| 64.9 \| 0.57 \|
	\| GPT-3.5-Turbo \| 7.94 \| 50.28 \| 70 \| 0.57 \|
	\| Claude-1 \| 7.9 \| - \| 77 \| - \|
	\| Openchat-3.5 \| 7.81 \| 37.08 \| 64.3 \| 0.39 \|
	\| Dolphin-2.6-DPO \| 7.74 \| 42.88 \| 61.9 \| 0.53 \|
	\| Zephyr-7B-beta \| 7.34 \| 38.71 \| 61.4 \| 0.30 \|
	\| Llama-2-70b-chat-hf \| 6.86 \| 51.56 \| 63 \| - \|
	\| Neural-chat-7b-v3-1 \| 6.84 \| 43.61 \| 62.4 \| 0.30 \|

	\| Model \| Average \| AGIEval \| GPT4All \| TruthfulQA \| Bigbench \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| Kunoichi-DPO-7B\|58.4\| 45.08 \| 74\| 66.99\| 47.52\|
	\| [Kunoichi-7B](https://huggingface.co/SanjiWatsuki/Kunoichi-7B)\|57.54\| 44.99\| 74.86\| 63.72\| 46.58\|
	\| [OpenPipe/mistral-ft-optimized-1218](https://huggingface.co/OpenPipe/mistral-ft-optimized-1218)\| 56.85 \| 44.74 \| 75.6 \| 59.89 \| 47.17 \|
	\| [Silicon-Maid-7B](https://huggingface.co/SanjiWatsuki/Silicon-Maid-7B) \| 56.45\| 44.74\| 74.26\| 61.5\| 45.32\|
	\| [mlabonne/NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B) \| 53.51 \| 43.67 \| 73.24 \| 55.37 \| 41.76 \|
	\| [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) \| 52.42 \| 42.75 \| 72.99 \| 52.99 \| 40.94 \|
	\| [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5) \| 51.34 \| 42.67 \| 72.92 \| 47.27 \| 42.51 \|
	\| [berkeley-nest/Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) \| 51.16 \| 42.06 \| 72.72 \| 47.33 \| 42.53 \|
	\| [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) \| 50.99 \| 37.33 \| 71.83 \| 55.1 \| 39.7 \|

	The model is intended to be used with up to an 8k context window. Using a NTK RoPE alpha of 2.6, the model can be used experimentally up to a 16k context window.

	<!-- description end -->
	<!-- prompt-template start -->
	## Prompt template: Custom format, or Alpaca

	### Alpaca:
	```
	Below is an instruction that describes a task. Write a response that appropriately completes the request.

	### Instruction:
	{prompt}

	### Response:
	```

	### SillyTavern format:
	I found the best SillyTavern results from using the Noromaid template.

	SillyTavern config files: [Context](https://files.catbox.moe/ifmhai.json), [Instruct](https://files.catbox.moe/ttw1l9.json).

	Additionally, here is my highly recommended [Text Completion preset](https://huggingface.co/SanjiWatsuki/Loyal-Macaroni-Maid-7B/blob/main/Characters/MinP.json). You can tweak this by adjusting temperature up or dropping min p to boost creativity or raise min p to increase stability. You shouldn't need to touch anything else!

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_SanjiWatsuki__Kunoichi-DPO-7B)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|72.24\|
	\|AI2 Reasoning Challenge (25-Shot)\|69.62\|
	\|HellaSwag (10-Shot) \|87.14\|
	\|MMLU (5-Shot) \|64.79\|
	\|TruthfulQA (0-shot) \|67.31\|
	\|Winogrande (5-shot) \|80.58\|
	\|GSM8k (5-shot) \|63.99\|