metadata

license: cc-by-nc-4.0
model-index:
  - name: Kunoichi-DPO-7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 69.62
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.14
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.79
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 67.31
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 80.58
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.99
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-7B
          name: Open LLM Leaderboard

Description

This repository hosts Kunoichi-DPO-7B, a DPO finetune using Intel's Orca pairs with the Alpaca template on Kunoichi-7B. This model is targeted at general use. In my testing, it has stronger reasoning and instruction following capabilities than Kunoichi-7B but it may be worse for roleplaying purposes due to the alignment from the Orca dataset.

This model is undergoing benchmark testing and I will update the model page with the finalized results.

Model	MT Bench	EQ Bench	MMLU	Logic Test
GPT-4-Turbo	9.32	-	-	-
GPT-4	8.99	62.52	86.4	0.86
Kunoichi-DPO-7B	8.29	41.60	-	0.59
Kunoichi-7B	8.14	44.32	64.9	0.58
Starling-7B	8.09	-	63.9	0.51
Claude-2	8.06	52.14	78.5	-
Silicon-Maid-7B	7.96	40.44	64.7	0.54
Loyal-Macaroni-Maid-7B	7.95	38.66	64.9	0.57
GPT-3.5-Turbo	7.94	50.28	70	0.57
Claude-1	7.9	-	77	-
Openchat-3.5	7.81	37.08	64.3	0.39
Dolphin-2.6-DPO	7.74	42.88	61.9	0.53
Zephyr-7B-beta	7.34	38.71	61.4	0.30
Llama-2-70b-chat-hf	6.86	51.56	63	-
Neural-chat-7b-v3-1	6.84	43.61	62.4	0.30

Model	Average	AGIEval	GPT4All	TruthfulQA	Bigbench
Kunoichi-DPO-7B	58.4	45.08	74	66.99	47.52
Kunoichi-7B	57.54	44.99	74.86	63.72	46.58
OpenPipe/mistral-ft-optimized-1218	56.85	44.74	75.6	59.89	47.17
Silicon-Maid-7B	56.45	44.74	74.26	61.5	45.32
mlabonne/NeuralHermes-2.5-Mistral-7B	53.51	43.67	73.24	55.37	41.76
teknium/OpenHermes-2.5-Mistral-7B	52.42	42.75	72.99	52.99	40.94
openchat/openchat_3.5	51.34	42.67	72.92	47.27	42.51
berkeley-nest/Starling-LM-7B-alpha	51.16	42.06	72.72	47.33	42.53
HuggingFaceH4/zephyr-7b-beta	50.99	37.33	71.83	55.1	39.7

The model is intended to be used with up to an 8k context window. Using a NTK RoPE alpha of 2.6, the model can be used experimentally up to a 16k context window.

Prompt template: Custom format, or Alpaca

Alpaca:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:

SillyTavern format:

I found the best SillyTavern results from using the Noromaid template.

SillyTavern config files: Context, Instruct.

Additionally, here is my highly recommended Text Completion preset. You can tweak this by adjusting temperature up or dropping min p to boost creativity or raise min p to increase stability. You shouldn't need to touch anything else!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	72.24
AI2 Reasoning Challenge (25-Shot)	69.62
HellaSwag (10-Shot)	87.14
MMLU (5-Shot)	64.79
TruthfulQA (0-shot)	67.31
Winogrande (5-shot)	80.58
GSM8k (5-shot)	63.99