README.md · cstr/Spaetzle-v8-7b at main

Spaetzle-v8-7b / README.md

cstr

Update README.md

d334d82 verified 2 months ago

preview code

raw history blame contribute delete

No virus

9.55 kB

	---
	tags:
	- merge
	- mergekit
	- lazymergekit
	- flemmingmiguel/NeuDist-Ro-7B
	- johannhartmann/Brezn3
	- ResplendentAI/Flora_DPO_7B
	base_model:
	- flemmingmiguel/NeuDist-Ro-7B
	- johannhartmann/Brezn3
	- ResplendentAI/Flora_DPO_7B
	language:
	- de
	- en
	---

	# Spaetzle-v8-7b

	This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.

	It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.

	It is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
	* [flemmingmiguel/NeuDist-Ro-7B](https://huggingface.co/flemmingmiguel/NeuDist-Ro-7B)
	* [johannhartmann/Brezn3](https://huggingface.co/johannhartmann/Brezn3)
	* [ResplendentAI/Flora_DPO_7B](https://huggingface.co/ResplendentAI/Flora_DPO_7B)
	* on the basis of [mayflowergmbh/Wiedervereinigung-7b-dpo-laser](https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo-laser)

	All credits are due to the creators of those original models and the training datasets involved.

	For a suitable quantized version, try [cstr/Spaetzle-v8-7b-GGUF](https://huggingface.co/cstr/Spaetzle-v8-7b-GGUF)


	## Evaluation
	[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__Spaetzle-v8-7b)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|72.27\|
	\|AI2 Reasoning Challenge (25-Shot)\|68.69\|
	\|HellaSwag (10-Shot) \|86.68\|
	\|MMLU (5-Shot) \|64.60\|
	\|TruthfulQA (0-shot) \|64.05\|
	\|Winogrande (5-shot) \|81.45\|
	\|GSM8k (5-shot) \|68.16\|

	EQ-Bench (v2_de): 61.04 / english (v2): 78.3

	[ScandEval](https://scandeval.com/german-nlg/) 12.5.2 scores

	\| Benchmark \| Spaetzle-v8-7b Value \|
	\|-----------------------\|----------------------------------------------------\|
	\| Model ID \| cstr/Spaetzle-v8-7b (few-shot, val) \|
	\| Parameters \| 7242 \|
	\| Vocabulary Size \| 32 \|
	\| Context \| 32768 \|
	\| Commercial \| False \|
	\| Speed \| 5,980 ± 1,031 / 1,714 ± 552 \|
	\| Rank \| 1.85 \|
	\| GermEval \| 58.90 ± 2.30 / 45.55 ± 3.30 \|
	\| SB10k \| 61.34 ± 1.90 / 72.98 ± 1.30 \|
	\| ScaLA-De \| 31.58 ± 4.39 / 65.51 ± 2.23 \|
	\| GermanQuAD \| 24.91 ± 3.98 / 60.88 ± 3.31 \|
	\| MLSum \| 67.25 ± 1.06 / 22.95 ± 2.64 \|
	\| MMLU-De \| 34.62 ± 2.20 / 50.43 ± 1.52 \|
	\| HellaSwag-De \| 48.70 ± 2.47 / 61.05 ± 1.79 \|


	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[Spaetzle-v8-7b](https://huggingface.co/cstr/Spaetzle-v8-7b)\| 45.31\| 75.69\| 63.94\| 45.57\| 57.63\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|25.59\|± \| 2.74\|
	\| \| \|acc_norm\|24.80\|± \| 2.72\|
	\|agieval_logiqa_en \| 0\|acc \|39.63\|± \| 1.92\|
	\| \| \|acc_norm\|39.78\|± \| 1.92\|
	\|agieval_lsat_ar \| 0\|acc \|23.48\|± \| 2.80\|
	\| \| \|acc_norm\|24.35\|± \| 2.84\|
	\|agieval_lsat_lr \| 0\|acc \|50.98\|± \| 2.22\|
	\| \| \|acc_norm\|51.96\|± \| 2.21\|
	\|agieval_lsat_rc \| 0\|acc \|62.08\|± \| 2.96\|
	\| \| \|acc_norm\|62.83\|± \| 2.95\|
	\|agieval_sat_en \| 0\|acc \|78.64\|± \| 2.86\|
	\| \| \|acc_norm\|79.13\|± \| 2.84\|
	\|agieval_sat_en_without_passage\| 0\|acc \|44.66\|± \| 3.47\|
	\| \| \|acc_norm\|44.66\|± \| 3.47\|
	\|agieval_sat_math \| 0\|acc \|37.27\|± \| 3.27\|
	\| \| \|acc_norm\|35.00\|± \| 3.22\|

	Average: 45.31%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|63.14\|± \| 1.41\|
	\| \| \|acc_norm\|64.51\|± \| 1.40\|
	\|arc_easy \| 0\|acc \|85.98\|± \| 0.71\|
	\| \| \|acc_norm\|82.49\|± \| 0.78\|
	\|boolq \| 1\|acc \|88.10\|± \| 0.57\|
	\|hellaswag \| 0\|acc \|66.31\|± \| 0.47\|
	\| \| \|acc_norm\|85.17\|± \| 0.35\|
	\|openbookqa \| 0\|acc \|38.00\|± \| 2.17\|
	\| \| \|acc_norm\|47.20\|± \| 2.23\|
	\|piqa \| 0\|acc \|83.35\|± \| 0.87\|
	\| \| \|acc_norm\|84.17\|± \| 0.85\|
	\|winogrande \| 0\|acc \|78.22\|± \| 1.16\|

	Average: 75.69%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|47.74\|± \| 1.75\|
	\| \| \|mc2 \|63.94\|± \| 1.53\|

	Average: 63.94%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|56.84\|± \| 3.60\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|66.12\|± \| 2.47\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|41.47\|± \| 3.07\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|22.01\|± \| 2.19\|
	\| \| \|exact_str_match \| 0.00\|± \| 0.00\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|31.40\|± \| 2.08\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|23.14\|± \| 1.60\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|56.00\|± \| 2.87\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|45.00\|± \| 2.23\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|50.70\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|70.05\|± \| 1.02\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|45.54\|± \| 2.36\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|26.05\|± \| 1.39\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|71.82\|± \| 3.35\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|72.92\|± \| 1.42\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|44.20\|± \| 1.57\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|22.80\|± \| 1.19\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|18.23\|± \| 0.92\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|56.00\|± \| 2.87\|

	Average: 45.57%

	Average score: 57.63%

	## 💻 Usage

	```python
	!pip install -qU transformers accelerate

	from transformers import AutoTokenizer
	import transformers
	import torch

	model = "cstr/Spaetzle-v8-7b"
	messages = [{"role": "user", "content": "What is a large language model?"}]

	tokenizer = AutoTokenizer.from_pretrained(model)
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	pipeline = transformers.pipeline(
	"text-generation",
	model=model,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
	print(outputs[0]["generated_text"])
	```


	## 🧩 Configuration

	The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training).

	```yaml
	models:
	- model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
	# no parameters necessary for base model
	- model: flemmingmiguel/NeuDist-Ro-7B
	parameters:
	density: 0.60
	weight: 0.30
	- model: johannhartmann/Brezn3
	parameters:
	density: 0.65
	weight: 0.40
	- model: ResplendentAI/Flora_DPO_7B
	parameters:
	density: 0.6
	weight: 0.3
	merge_method: dare_ties
	base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
	parameters:
	int8_mask: true
	dtype: bfloat16
	random_seed: 0
	tokenizer_source: base
	```