Spaetzle-v8-7b / README.md
cstr's picture
Update README.md
d334d82 verified
metadata
tags:
  - merge
  - mergekit
  - lazymergekit
  - flemmingmiguel/NeuDist-Ro-7B
  - johannhartmann/Brezn3
  - ResplendentAI/Flora_DPO_7B
base_model:
  - flemmingmiguel/NeuDist-Ro-7B
  - johannhartmann/Brezn3
  - ResplendentAI/Flora_DPO_7B
language:
  - de
  - en

Spaetzle-v8-7b

This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.

It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.

It is a merge of the following models using LazyMergekit:

All credits are due to the creators of those original models and the training datasets involved.

For a suitable quantized version, try cstr/Spaetzle-v8-7b-GGUF

Evaluation

Open LLM Leaderboard Evaluation Results Detailed results can be found here

Metric Value
Avg. 72.27
AI2 Reasoning Challenge (25-Shot) 68.69
HellaSwag (10-Shot) 86.68
MMLU (5-Shot) 64.60
TruthfulQA (0-shot) 64.05
Winogrande (5-shot) 81.45
GSM8k (5-shot) 68.16

EQ-Bench (v2_de): 61.04 / english (v2): 78.3

ScandEval 12.5.2 scores

Benchmark Spaetzle-v8-7b Value
Model ID cstr/Spaetzle-v8-7b (few-shot, val)
Parameters 7242
Vocabulary Size 32
Context 32768
Commercial False
Speed 5,980 ± 1,031 / 1,714 ± 552
Rank 1.85
GermEval 58.90 ± 2.30 / 45.55 ± 3.30
SB10k 61.34 ± 1.90 / 72.98 ± 1.30
ScaLA-De 31.58 ± 4.39 / 65.51 ± 2.23
GermanQuAD 24.91 ± 3.98 / 60.88 ± 3.31
MLSum 67.25 ± 1.06 / 22.95 ± 2.64
MMLU-De 34.62 ± 2.20 / 50.43 ± 1.52
HellaSwag-De 48.70 ± 2.47 / 61.05 ± 1.79
Model AGIEval GPT4All TruthfulQA Bigbench Average
Spaetzle-v8-7b 45.31 75.69 63.94 45.57 57.63

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 25.59 ± 2.74
acc_norm 24.80 ± 2.72
agieval_logiqa_en 0 acc 39.63 ± 1.92
acc_norm 39.78 ± 1.92
agieval_lsat_ar 0 acc 23.48 ± 2.80
acc_norm 24.35 ± 2.84
agieval_lsat_lr 0 acc 50.98 ± 2.22
acc_norm 51.96 ± 2.21
agieval_lsat_rc 0 acc 62.08 ± 2.96
acc_norm 62.83 ± 2.95
agieval_sat_en 0 acc 78.64 ± 2.86
acc_norm 79.13 ± 2.84
agieval_sat_en_without_passage 0 acc 44.66 ± 3.47
acc_norm 44.66 ± 3.47
agieval_sat_math 0 acc 37.27 ± 3.27
acc_norm 35.00 ± 3.22

Average: 45.31%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 63.14 ± 1.41
acc_norm 64.51 ± 1.40
arc_easy 0 acc 85.98 ± 0.71
acc_norm 82.49 ± 0.78
boolq 1 acc 88.10 ± 0.57
hellaswag 0 acc 66.31 ± 0.47
acc_norm 85.17 ± 0.35
openbookqa 0 acc 38.00 ± 2.17
acc_norm 47.20 ± 2.23
piqa 0 acc 83.35 ± 0.87
acc_norm 84.17 ± 0.85
winogrande 0 acc 78.22 ± 1.16

Average: 75.69%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 47.74 ± 1.75
mc2 63.94 ± 1.53

Average: 63.94%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 56.84 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 66.12 ± 2.47
bigbench_disambiguation_qa 0 multiple_choice_grade 41.47 ± 3.07
bigbench_geometric_shapes 0 multiple_choice_grade 22.01 ± 2.19
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 31.40 ± 2.08
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.14 ± 1.60
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 56.00 ± 2.87
bigbench_movie_recommendation 0 multiple_choice_grade 45.00 ± 2.23
bigbench_navigate 0 multiple_choice_grade 50.70 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 70.05 ± 1.02
bigbench_ruin_names 0 multiple_choice_grade 45.54 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 26.05 ± 1.39
bigbench_snarks 0 multiple_choice_grade 71.82 ± 3.35
bigbench_sports_understanding 0 multiple_choice_grade 72.92 ± 1.42
bigbench_temporal_sequences 0 multiple_choice_grade 44.20 ± 1.57
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.80 ± 1.19
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 18.23 ± 0.92
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 56.00 ± 2.87

Average: 45.57%

Average score: 57.63%

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/Spaetzle-v8-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

🧩 Configuration

The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training).

models:
  - model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
    # no parameters necessary for base model
  - model: flemmingmiguel/NeuDist-Ro-7B
    parameters:
      density: 0.60
      weight: 0.30
  - model: johannhartmann/Brezn3
    parameters:
      density: 0.65
      weight: 0.40
  - model: ResplendentAI/Flora_DPO_7B
    parameters:
      density: 0.6
      weight: 0.3
merge_method: dare_ties
base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base