G-reen
/

EXPERIMENT-DPO-m7b2-3-merged

Text Generation

Inference Endpoints

text-generation-inference

4-bit precision

Model card Files Files and versions Community

EXPERIMENT-DPO-m7b2-3-merged / README.md

G-reen's picture

Update README.md

e89c954 verified 2 months ago

|

raw history blame contribute delete

No virus

1.3 kB

	---
	license: "apache-2.0"
	---

	This model was trained as part of a series of experiments testing the performance of pure DPO vs SFT vs ORPO, all supported by Unsloth/Huggingface TRL.

	Note: This model failed to train because the LR was too high (stopped early at 300 steps). Do not use!

	Benchmarks

	Average 29.55

	ARC 29.52

	HellaSwag 25.9

	MMLU 23.12

	TruthfulQA 48.27

	Winogrande 50.51

	GSM8K 0

	Training Details

	Duration: ~3 hours on one Kaggle T4 with Unsloth

	Model: https://huggingface.co/unsloth/mistral-7b-v0.2-bnb-4bit

	Dataset: https://huggingface.co/datasets/argilla/dpo-mix-7k

	Rank: 8

	Alpha: 16

	Learning rate: 5e-4

	Beta: 0.1

	Batch size: 8

	Epochs: 1

	Learning rate scheduler: Linear

	Prompt Format: ChatML
	```
	<\|im_start\|>system
	You are a helpful assistant.<\|im_end\|>
	<\|im_start\|>user
	Why is the sky blue?<\|im_end\|>
	<\|im_start\|>assistant
	```


	WanDB Reports

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65a5c0e82823ba72ed2cee7d/hICdKQjk2mkRODyhyUfVV.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65a5c0e82823ba72ed2cee7d/CZobUkPWirguxOoE_hll0.png)

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)