Spaetzle-v69-7b / README.md
cstr's picture
Adding Evaluation Results
aa08118 verified
metadata
language:
  - de
  - en
license: cc-by-nc-4.0
tags:
  - merge
  - mergekit
  - lazymergekit
base_model:
  - abideen/AlphaMonarch-dora
  - mayflowergmbh/Wiedervereinigung-7b-dpo
  - flemmingmiguel/NeuDist-Ro-7B
  - ResplendentAI/Flora_DPO_7B
  - yleo/EmertonMonarch-7B
  - occiglot/occiglot-7b-de-en-instruct
  - OpenPipe/mistral-ft-optimized-1227
  - DiscoResearch/DiscoLM_German_7b_v1
  - LeoLM/leo-mistral-hessianai-7b
  - DRXD1000/Phoenix
  - VAGOsolutions/SauerkrautLM-7b-v1-mistral
  - malteos/hermeo-7b
  - FelixChao/WestSeverus-7B-DPO-v2
  - cognitivecomputations/openchat-3.5-0106-laser
model-index:
  - name: Spaetzle-v69-7b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 69.54
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 86.77
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.63
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 65.61
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 81.93
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 68.76
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cstr/Spaetzle-v69-7b
          name: Open LLM Leaderboard

Spaetzle-v69-7b

This is a progressive (mostly dare-ties, but also slerp) merge with the intention of a suitable compromise for English and German local tasks.

There is also a 4q_k_m quantized GGUF.

It should work sufficiently well with ChatML prompt template (for all merged models should have seen ChatML prompts at least in DPO stage).

Evaluation

Benchmark scores are not the possible optimum, as the model attempts a compromise with a number of parameters, like German language performance, instruction following, reasoning capabilities, robustness (so far, i did not encounter inserted tokens, e.g.), model licensing, and other criteria. Nevertheless, they are not too bad:

It achieves (running quantized) in

  • German EQ Bench: Score (v2_de): 62.59 (Parseable: 171.0).
  • English EQ Bench: Score (v2): 76.43 (Parseable: 171.0).
Model AGIEval GPT4All TruthfulQA Bigbench Average
Spaetzle-v69-7b 44.48 75.84 66.15 46.59 58.27

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 25.98 ± 2.76
acc_norm 23.62 ± 2.67
agieval_logiqa_en 0 acc 39.78 ± 1.92
acc_norm 39.48 ± 1.92
agieval_lsat_ar 0 acc 23.48 ± 2.80
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 50.00 ± 2.22
acc_norm 51.76 ± 2.21
agieval_lsat_rc 0 acc 63.94 ± 2.93
acc_norm 64.31 ± 2.93
agieval_sat_en 0 acc 76.70 ± 2.95
acc_norm 77.67 ± 2.91
agieval_sat_en_without_passage 0 acc 46.12 ± 3.48
acc_norm 44.17 ± 3.47
agieval_sat_math 0 acc 34.09 ± 3.20
acc_norm 30.91 ± 3.12

Average: 44.48%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 63.23 ± 1.41
acc_norm 64.16 ± 1.40
arc_easy 0 acc 85.90 ± 0.71
acc_norm 82.49 ± 0.78
boolq 1 acc 87.80 ± 0.57
hellaswag 0 acc 67.05 ± 0.47
acc_norm 85.19 ± 0.35
openbookqa 0 acc 38.40 ± 2.18
acc_norm 48.40 ± 2.24
piqa 0 acc 82.75 ± 0.88
acc_norm 84.28 ± 0.85
winogrande 0 acc 78.53 ± 1.15

Average: 75.84%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 50.67 ± 1.75
mc2 66.15 ± 1.48

Average: 66.15%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 56.84 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 66.67 ± 2.46
bigbench_disambiguation_qa 0 multiple_choice_grade 40.70 ± 3.06
bigbench_geometric_shapes 0 multiple_choice_grade 24.79 ± 2.28
exact_str_match 10.58 ± 1.63
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 31.00 ± 2.07
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.00 ± 1.59
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 58.00 ± 2.85
bigbench_movie_recommendation 0 multiple_choice_grade 45.80 ± 2.23
bigbench_navigate 0 multiple_choice_grade 52.10 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 69.55 ± 1.03
bigbench_ruin_names 0 multiple_choice_grade 48.88 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 30.96 ± 1.46
bigbench_snarks 0 multiple_choice_grade 73.48 ± 3.29
bigbench_sports_understanding 0 multiple_choice_grade 74.14 ± 1.40
bigbench_temporal_sequences 0 multiple_choice_grade 42.70 ± 1.56
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.60 ± 1.20
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 18.40 ± 0.93
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 58.00 ± 2.85

Average: 46.59%

Average score: 58.27%

🧩 Merge Configuration

Spaetzle-v69-7b is a merge of the following models using LazyMergekit:

The merge tree in total involves the following original models:

For this last merge:

models:
  - model: cstr/Spaetzle-v68-7b
    # no parameters necessary for base model
  - model: abideen/AlphaMonarch-dora
    parameters:
      density: 0.60
      weight: 0.30
merge_method: dare_ties
base_model: cstr/Spaetzle-v68-7b
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/Spaetzle-v69-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 72.87
AI2 Reasoning Challenge (25-Shot) 69.54
HellaSwag (10-Shot) 86.77
MMLU (5-Shot) 64.63
TruthfulQA (0-shot) 65.61
Winogrande (5-shot) 81.93
GSM8k (5-shot) 68.76