license: cc-by-nc-4.0
model-index:
- name: Kunoichi-DPO-7B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 69.62
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 87.14
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 64.79
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 67.31
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 80.58
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 63.99
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
name: Open LLM Leaderboard
Description
This repository hosts Kunoichi-DPO-7B, a DPO finetune using Intel's Orca pairs with the Alpaca template on Kunoichi-7B. This model is targeted at general use. In my testing, it has stronger reasoning and instruction following capabilities than Kunoichi-7B but it may be worse for roleplaying purposes due to the alignment from the Orca dataset.
This model is undergoing benchmark testing and I will update the model page with the finalized results.
Model | MT Bench | EQ Bench | MMLU | Logic Test |
---|---|---|---|---|
GPT-4-Turbo | 9.32 | - | - | - |
GPT-4 | 8.99 | 62.52 | 86.4 | 0.86 |
Kunoichi-DPO-7B | 8.29 | 41.60 | - | 0.59 |
Kunoichi-7B | 8.14 | 44.32 | 64.9 | 0.58 |
Starling-7B | 8.09 | - | 63.9 | 0.51 |
Claude-2 | 8.06 | 52.14 | 78.5 | - |
Silicon-Maid-7B | 7.96 | 40.44 | 64.7 | 0.54 |
Loyal-Macaroni-Maid-7B | 7.95 | 38.66 | 64.9 | 0.57 |
GPT-3.5-Turbo | 7.94 | 50.28 | 70 | 0.57 |
Claude-1 | 7.9 | - | 77 | - |
Openchat-3.5 | 7.81 | 37.08 | 64.3 | 0.39 |
Dolphin-2.6-DPO | 7.74 | 42.88 | 61.9 | 0.53 |
Zephyr-7B-beta | 7.34 | 38.71 | 61.4 | 0.30 |
Llama-2-70b-chat-hf | 6.86 | 51.56 | 63 | - |
Neural-chat-7b-v3-1 | 6.84 | 43.61 | 62.4 | 0.30 |
Model | Average | AGIEval | GPT4All | TruthfulQA | Bigbench |
---|---|---|---|---|---|
Kunoichi-DPO-7B | 58.4 | 45.08 | 74 | 66.99 | 47.52 |
Kunoichi-7B | 57.54 | 44.99 | 74.86 | 63.72 | 46.58 |
OpenPipe/mistral-ft-optimized-1218 | 56.85 | 44.74 | 75.6 | 59.89 | 47.17 |
Silicon-Maid-7B | 56.45 | 44.74 | 74.26 | 61.5 | 45.32 |
mlabonne/NeuralHermes-2.5-Mistral-7B | 53.51 | 43.67 | 73.24 | 55.37 | 41.76 |
teknium/OpenHermes-2.5-Mistral-7B | 52.42 | 42.75 | 72.99 | 52.99 | 40.94 |
openchat/openchat_3.5 | 51.34 | 42.67 | 72.92 | 47.27 | 42.51 |
berkeley-nest/Starling-LM-7B-alpha | 51.16 | 42.06 | 72.72 | 47.33 | 42.53 |
HuggingFaceH4/zephyr-7b-beta | 50.99 | 37.33 | 71.83 | 55.1 | 39.7 |
The model is intended to be used with up to an 8k context window. Using a NTK RoPE alpha of 2.6, the model can be used experimentally up to a 16k context window.
Prompt template: Custom format, or Alpaca
Alpaca:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{prompt}
### Response:
SillyTavern format:
I found the best SillyTavern results from using the Noromaid template.
SillyTavern config files: Context, Instruct.
Additionally, here is my highly recommended Text Completion preset. You can tweak this by adjusting temperature up or dropping min p to boost creativity or raise min p to increase stability. You shouldn't need to touch anything else!
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 72.24 |
AI2 Reasoning Challenge (25-Shot) | 69.62 |
HellaSwag (10-Shot) | 87.14 |
MMLU (5-Shot) | 64.79 |
TruthfulQA (0-shot) | 67.31 |
Winogrande (5-shot) | 80.58 |
GSM8k (5-shot) | 63.99 |