badger-l3-instruct-32k / README.md

maldv

Adding Evaluation Results (#1)

cf4277f verified 3 months ago

preview code

raw

history blame

No virus

4.71 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	tags:
	- llama-3
	model-index:
	- name: badger-l3-instruct-32k
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 63.65
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 81.4
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 67.13
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 55.02
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 77.35
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 72.4
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maldv/badger-l3-instruct-32k
	name: Open LLM Leaderboard
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b19c1b098c85365af5a83e/5dq0evzBjVulEOjYHW68O.png)

	# Badger/δ Llama 3 Instruct 32k

	I haven't been releasing my base merges so far, but this one seems worthy.

	Badger is a recursive maximally disjoint pairwise normalized fourier interpolation of the following models:

	```python
	models = [
	'Einstein-v6.1-Llama3-8B',
	'L3-TheSpice-8b-v0.8.3',
	'dolphin-2.9-llama3-8b',
	'Configurable-Hermes-2-Pro-Llama-3-8B',
	'MAmmoTH2-8B-Plus',
	'Pantheon-RP-1.0-8b-Llama-3',
	'Tiamat-8b-1.2-Llama-3-DPO',
	'Buzz-8b-Large-v0.5',
	'Kei_Llama3_8B',
	'Llama-3-Lumimaid-8B-v0.1',
	'llama-3-cat-8b-instruct-pytorch',
	'Llama-3SOME-8B-v1',
	'Roleplay-Llama-3-8B',
	'Llama-3-LewdPlay-8B-evo',
	'opus-v1.2-llama-3-8b-instruct-run3.5-epoch2.5',
	'meta-llama-3-8b-instruct-hf-ortho-baukit-5fail-3000total-bf16',
	'Poppy_Porpoise-0.72-L3-8B',
	'Llama-3-8B-Instruct-norefusal',
	'Meta-Llama-3-8B-Instruct-DPO',
	'badger',
	'Llama-3-Refueled',
	'Llama-3-8B-Instruct-DPO-v0.4',
	'Llama-3-8B-Instruct-Gradient-1048k',
	'Mahou-1.0-llama3-8B',
	'Llama-3-SauerkrautLM-8b-Instruct',
	'Llama-3-Soliloquy-8B-v2'
	]
	```

	I have included the notebook code I used to generate the model, for any that are curious. I have adjusted the config for rope scale 4, and 16k-32k context both seem coherent.

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_maldv__badger-l3-instruct-32k)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|69.49\|
	\|AI2 Reasoning Challenge (25-Shot)\|63.65\|
	\|HellaSwag (10-Shot) \|81.40\|
	\|MMLU (5-Shot) \|67.13\|
	\|TruthfulQA (0-shot) \|55.02\|
	\|Winogrande (5-shot) \|77.35\|
	\|GSM8k (5-shot) \|72.40\|