Storcel-7b / README.md

Adding Evaluation Results

16652dc verified 11 months ago

6.13 kB

	---
	language:
	- en
	license: mit
	tags:
	- merge
	- slerp
	datasets:
	- Open-Orca/OpenOrca
	- conceptofmind/cot_submix_original
	- conceptofmind/t0_submix_original
	- conceptofmind/niv2_submix_original
	- conceptofmind/flan2021_submix_original
	- ehartford/dolphin
	metrics:
	- accuracy
	- bleu
	inference: false
	model-index:
	- name: Dorflan
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 54.44
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 75.78
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 51.36
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 51.17
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 72.61
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 0.38
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=formulae/Dorflan
	name: Open LLM Leaderboard
	---
	<h1 style="text-align: center">Dorflan</h1>
	<h2 style="text-align: center">An experimental model</h2>
	<hr>


	\| Model \| Average ⬆️ \| ARC \| HellaSwag \| MMLU \| TruthfulQA \|
	\|:------------:\|:------------:\|:-------:\|:---------:\|:-------:\|:----------:\|
	\| formulae/Dorflan 📑 \| 58.19 \| 54.44 \| 75.78 \| 51.36 \| 51.17 \|



	## Model Details
	Dorflan is an experimental merged model created from the following three foundation models:

	- stabilityai/StableBeluga-7B
	- ehartford/dolphin-llama2-7b
	- AIDC-ai-business/Marcoroni-7B

	Dorflan was created by merging the weights and architectures of these three models using a custom merging technique. No further fine-tuning was performed after the merge.

	Once the model obtains it's evaluation scores, then we'll know if it works or not.

	## Intended Use
	As an experimental model, Dorflan is intended for testing and research purposes only. It should not be used for production systems or to generate content for public use.

	## Training Data
	Dorflan inherits training data from its three foundation models:

	- StableBeluga-7B: COT, Niv2, t0, & FLAN2021
	- dolphin-llama2-7b: Dolphin
	- Marcoroni-7B: OpenOrca

	## Limitations
	As an untested merged model, Dorflan has unknown capabilities and limitations. Potential issues include:

	- Instability due to merged architectures
	- Compounded bias and issues from all three foundation models
	- Decreased performance on some tasks compared to the foundation models

	Extensive testing is required to characterize Dorflan's capabilities and limitations.

	## Ethical Considerations
	- Dorflan may exhibit harmful biases inherited from its training data
	- Output may be unreliable or manipulated due to instability
	- Experimental nature increases potential for misuse

	Use this model ethically and do not deploy it for sensitive applications.

	## Contact Information
	Please report issues or concerns with this model to the creator for further investigation.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_formulae__Dorflan)

	\| Metric \| Value \|
	\|-----------------------\|---------------------------\|
	\| Avg. \| 47.44 \|
	\| ARC (25-shot) \| 54.44 \|
	\| HellaSwag (10-shot) \| 75.78 \|
	\| MMLU (5-shot) \| 51.36 \|
	\| TruthfulQA (0-shot) \| 51.17 \|
	\| Winogrande (5-shot) \| 72.61 \|
	\| GSM8K (5-shot) \| 0.38 \|
	\| DROP (3-shot) \| 26.37 \|

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_formulae__Dorflan)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|50.96\|
	\|AI2 Reasoning Challenge (25-Shot)\|54.44\|
	\|HellaSwag (10-Shot) \|75.78\|
	\|MMLU (5-Shot) \|51.36\|
	\|TruthfulQA (0-shot) \|51.17\|
	\|Winogrande (5-shot) \|72.61\|
	\|GSM8k (5-shot) \| 0.38\|