Spaetzle-v58-7b

This is only for experimenting with merges that involve the somewhat cumbersome Occiglot. This one here performs not too bad, with EQ Bench Score (v2_de): 61.52 and english EQ Bench Score (v2): 75.69 But it produces some unwanted tokens still and we could get better benchmark results, but so far in tradeoffs with perceived german language quality.

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
Spaetzle-v58-7b	44.03	75.5	60.77	45.78	56.52

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	22.83	±	2.64
		acc_norm	22.83	±	2.64
agieval_logiqa_en	0	acc	37.94	±	1.90
		acc_norm	39.78	±	1.92
agieval_lsat_ar	0	acc	23.48	±	2.80
		acc_norm	21.74	±	2.73
agieval_lsat_lr	0	acc	48.63	±	2.22
		acc_norm	50.78	±	2.22
agieval_lsat_rc	0	acc	62.45	±	2.96
		acc_norm	61.71	±	2.97
agieval_sat_en	0	acc	77.18	±	2.93
		acc_norm	75.73	±	2.99
agieval_sat_en_without_passage	0	acc	46.12	±	3.48
		acc_norm	45.15	±	3.48
agieval_sat_math	0	acc	37.27	±	3.27
		acc_norm	34.55	±	3.21

Average: 44.03%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	61.86	±	1.42
		acc_norm	62.80	±	1.41
arc_easy	0	acc	85.31	±	0.73
		acc_norm	82.58	±	0.78
boolq	1	acc	87.80	±	0.57
hellaswag	0	acc	66.07	±	0.47
		acc_norm	84.37	±	0.36
openbookqa	0	acc	38.20	±	2.18
		acc_norm	49.00	±	2.24
piqa	0	acc	82.54	±	0.89
		acc_norm	84.44	±	0.85
winogrande	0	acc	77.51	±	1.17

Average: 75.5%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	44.55	±	1.74
		mc2	60.77	±	1.54

Average: 60.77%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	56.84	±	3.60
bigbench_date_understanding	0	multiple_choice_grade	66.40	±	2.46
bigbench_disambiguation_qa	0	multiple_choice_grade	35.27	±	2.98
bigbench_geometric_shapes	0	multiple_choice_grade	36.21	±	2.54
		exact_str_match	18.11	±	2.04
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	32.20	±	2.09
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.00	±	1.59
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	56.33	±	2.87
bigbench_movie_recommendation	0	multiple_choice_grade	42.40	±	2.21
bigbench_navigate	0	multiple_choice_grade	50.10	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	70.50	±	1.02
bigbench_ruin_names	0	multiple_choice_grade	45.09	±	2.35
bigbench_salient_translation_error_detection	0	multiple_choice_grade	36.97	±	1.53
bigbench_snarks	0	multiple_choice_grade	71.82	±	3.35
bigbench_sports_understanding	0	multiple_choice_grade	69.78	±	1.46
bigbench_temporal_sequences	0	multiple_choice_grade	35.50	±	1.51
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	21.52	±	1.16
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.83	±	0.92
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	56.33	±	2.87

Average: 45.78%

Average score: 56.52%

Elapsed time: 02:03:03

Spaetzle-v58-7b is a merge of the following models using LazyMergekit:

🧩 Configuration

models:
  - model: cstr/Spaetzle-v57-7b
    # no parameters necessary for base model
  - model: cstr/Spaetzle-v31-7b
    parameters:
      density: 0.60
      weight: 0.30
  - model: cstr/Spaetzle-v12-7b
    parameters:
      density: 0.65
      weight: 0.30
merge_method: dare_ties
base_model: cstr/Spaetzle-v57-7b
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/Spaetzle-v58-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

cstr
/

Spaetzle-v58-7b

Spaetzle-v58-7b

AGIEval

GPT4All

TruthfulQA

Bigbench

🧩 Configuration

💻 Usage

Merge of

Collection including cstr/Spaetzle-v58-7b

Spaetzle

Spaetzle-v58-7b

AGIEval

GPT4All

TruthfulQA

Bigbench

🧩 Configuration

💻 Usage

Merge of cstr/Spaetzle-v31-7b cstr/Spaetzle-v12-7b

Collection including cstr/Spaetzle-v58-7b

Merge of