README.md · cstr/Spaetzle-v8-7b at main

File size: 9,550 Bytes

---
tags:
- merge
- mergekit
- lazymergekit
- flemmingmiguel/NeuDist-Ro-7B
- johannhartmann/Brezn3
- ResplendentAI/Flora_DPO_7B
base_model:
- flemmingmiguel/NeuDist-Ro-7B
- johannhartmann/Brezn3
- ResplendentAI/Flora_DPO_7B
language:
- de
- en
---

# Spaetzle-v8-7b

This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.

It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.

It is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
* [flemmingmiguel/NeuDist-Ro-7B](https://huggingface.co/flemmingmiguel/NeuDist-Ro-7B)
* [johannhartmann/Brezn3](https://huggingface.co/johannhartmann/Brezn3)
* [ResplendentAI/Flora_DPO_7B](https://huggingface.co/ResplendentAI/Flora_DPO_7B)
* on the basis of [mayflowergmbh/Wiedervereinigung-7b-dpo-laser](https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo-laser)

All credits are due to the creators of those original models and the training datasets involved.

For a suitable quantized version, try [cstr/Spaetzle-v8-7b-GGUF](https://huggingface.co/cstr/Spaetzle-v8-7b-GGUF)


## Evaluation
[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__Spaetzle-v8-7b)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |72.27|
|AI2 Reasoning Challenge (25-Shot)|68.69|
|HellaSwag (10-Shot)              |86.68|
|MMLU (5-Shot)                    |64.60|
|TruthfulQA (0-shot)              |64.05|
|Winogrande (5-shot)              |81.45|
|GSM8k (5-shot)                   |68.16|

EQ-Bench (v2_de): 61.04 / english (v2): 78.3

[ScandEval](https://scandeval.com/german-nlg/) 12.5.2 scores 

| Benchmark             | Spaetzle-v8-7b Value                               |
|-----------------------|----------------------------------------------------|
| Model ID              | cstr/Spaetzle-v8-7b (few-shot, val)                |
| Parameters            | 7242                                               |
| Vocabulary Size       | 32                                                 |
| Context               | 32768                                              |
| Commercial            | False                                              |
| Speed                 | 5,980 ± 1,031 / 1,714 ± 552                        |
| Rank                  | 1.85                                               |
| GermEval              | 58.90 ± 2.30 / 45.55 ± 3.30                        |
| SB10k                 | 61.34 ± 1.90 / 72.98 ± 1.30                        |
| ScaLA-De              | 31.58 ± 4.39 / 65.51 ± 2.23                        |
| GermanQuAD            | 24.91 ± 3.98 / 60.88 ± 3.31                        |
| MLSum                 | 67.25 ± 1.06 / 22.95 ± 2.64                        |
| MMLU-De               | 34.62 ± 2.20 / 50.43 ± 1.52                        |
| HellaSwag-De          | 48.70 ± 2.47 / 61.05 ± 1.79                        |


|                           Model                            |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|[Spaetzle-v8-7b](https://huggingface.co/cstr/Spaetzle-v8-7b)|  45.31|  75.69|     63.94|   45.57|  57.63|

### AGIEval
|             Task             |Version| Metric |Value|   |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |25.59|±  |  2.74|
|                              |       |acc_norm|24.80|±  |  2.72|
|agieval_logiqa_en             |      0|acc     |39.63|±  |  1.92|
|                              |       |acc_norm|39.78|±  |  1.92|
|agieval_lsat_ar               |      0|acc     |23.48|±  |  2.80|
|                              |       |acc_norm|24.35|±  |  2.84|
|agieval_lsat_lr               |      0|acc     |50.98|±  |  2.22|
|                              |       |acc_norm|51.96|±  |  2.21|
|agieval_lsat_rc               |      0|acc     |62.08|±  |  2.96|
|                              |       |acc_norm|62.83|±  |  2.95|
|agieval_sat_en                |      0|acc     |78.64|±  |  2.86|
|                              |       |acc_norm|79.13|±  |  2.84|
|agieval_sat_en_without_passage|      0|acc     |44.66|±  |  3.47|
|                              |       |acc_norm|44.66|±  |  3.47|
|agieval_sat_math              |      0|acc     |37.27|±  |  3.27|
|                              |       |acc_norm|35.00|±  |  3.22|

Average: 45.31%

### GPT4All
|    Task     |Version| Metric |Value|   |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge|      0|acc     |63.14|±  |  1.41|
|             |       |acc_norm|64.51|±  |  1.40|
|arc_easy     |      0|acc     |85.98|±  |  0.71|
|             |       |acc_norm|82.49|±  |  0.78|
|boolq        |      1|acc     |88.10|±  |  0.57|
|hellaswag    |      0|acc     |66.31|±  |  0.47|
|             |       |acc_norm|85.17|±  |  0.35|
|openbookqa   |      0|acc     |38.00|±  |  2.17|
|             |       |acc_norm|47.20|±  |  2.23|
|piqa         |      0|acc     |83.35|±  |  0.87|
|             |       |acc_norm|84.17|±  |  0.85|
|winogrande   |      0|acc     |78.22|±  |  1.16|

Average: 75.69%

### TruthfulQA
|    Task     |Version|Metric|Value|   |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc|      1|mc1   |47.74|±  |  1.75|
|             |       |mc2   |63.94|±  |  1.53|

Average: 63.94%

### Bigbench
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|56.84|±  |  3.60|
|bigbench_date_understanding                     |      0|multiple_choice_grade|66.12|±  |  2.47|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|41.47|±  |  3.07|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|22.01|±  |  2.19|
|                                                |       |exact_str_match      | 0.00|±  |  0.00|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|31.40|±  |  2.08|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|23.14|±  |  1.60|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|56.00|±  |  2.87|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|45.00|±  |  2.23|
|bigbench_navigate                               |      0|multiple_choice_grade|50.70|±  |  1.58|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|70.05|±  |  1.02|
|bigbench_ruin_names                             |      0|multiple_choice_grade|45.54|±  |  2.36|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|26.05|±  |  1.39|
|bigbench_snarks                                 |      0|multiple_choice_grade|71.82|±  |  3.35|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|72.92|±  |  1.42|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|44.20|±  |  1.57|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|22.80|±  |  1.19|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|18.23|±  |  0.92|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|56.00|±  |  2.87|

Average: 45.57%

Average score: 57.63%

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/Spaetzle-v8-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```


## 🧩 Configuration

The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training). 

```yaml
models:
  - model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
    # no parameters necessary for base model
  - model: flemmingmiguel/NeuDist-Ro-7B
    parameters:
      density: 0.60
      weight: 0.30
  - model: johannhartmann/Brezn3
    parameters:
      density: 0.65
      weight: 0.40
  - model: ResplendentAI/Flora_DPO_7B
    parameters:
      density: 0.6
      weight: 0.3
merge_method: dare_ties
base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
```