End of training

b9dabf6 verified about 2 months ago

4.36 kB

	---
	base_model: slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.8
	datasets:
	- slm-research-vn/dpo-format-function-calling-v4
	- slm-research-vn/dpo-format-glaive-code-assistant-v3-with-mistral-large-slm-iter4
	- argilla/dpo-mix-7k
	library_name: peft
	tags:
	- alignment-handbook
	- trl
	- dpo
	- generated_from_trainer
	model-index:
	- name: Qwen2-7B-Instruct-SPPO-Function-call-v2.12
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Qwen2-7B-Instruct-SPPO-Function-call-v2.12

	This model is a fine-tuned version of [slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.8](https://huggingface.co/slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.8) on the slm-research-vn/dpo-format-function-calling-v4, the slm-research-vn/dpo-format-glaive-code-assistant-v3-with-mistral-large-slm-iter4 and the argilla/dpo-mix-7k datasets.
	It achieves the following results on the evaluation set:
	- Loss: 0.3322
	- Rewards/chosen: 0.5523
	- Rewards/rejected: -0.7005
	- Rewards/accuracies: 0.9017
	- Rewards/margins: 1.2528
	- Logps/rejected: -278.7327
	- Logps/chosen: -129.0717
	- Logits/rejected: -0.5984
	- Logits/chosen: -0.7738

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-06
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 8
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 32
	- total_eval_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 1

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Rewards/chosen \| Rewards/rejected \| Rewards/accuracies \| Rewards/margins \| Logps/rejected \| Logps/chosen \| Logits/rejected \| Logits/chosen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:--------------:\|:----------------:\|:------------------:\|:---------------:\|:--------------:\|:------------:\|:---------------:\|:-------------:\|
	\| 0.6806 \| 0.0916 \| 100 \| 0.6816 \| 0.0303 \| 0.0099 \| 0.6445 \| 0.0205 \| -264.5260 \| -139.5110 \| -0.5879 \| -0.7638 \|
	\| 0.5704 \| 0.1832 \| 200 \| 0.5993 \| 0.3495 \| 0.1473 \| 0.8237 \| 0.2023 \| -261.7780 \| -133.1277 \| -0.5881 \| -0.7638 \|
	\| 0.5032 \| 0.2749 \| 300 \| 0.5313 \| 0.5795 \| 0.1792 \| 0.8526 \| 0.4003 \| -261.1383 \| -128.5271 \| -0.5893 \| -0.7651 \|
	\| 0.4548 \| 0.3665 \| 400 \| 0.4727 \| 0.6406 \| 0.0523 \| 0.8844 \| 0.5884 \| -263.6780 \| -127.3051 \| -0.5901 \| -0.7660 \|
	\| 0.3823 \| 0.4581 \| 500 \| 0.4235 \| 0.6412 \| -0.1314 \| 0.8931 \| 0.7726 \| -267.3507 \| -127.2934 \| -0.5914 \| -0.7672 \|
	\| 0.3513 \| 0.5497 \| 600 \| 0.3843 \| 0.6087 \| -0.3415 \| 0.9133 \| 0.9502 \| -271.5532 \| -127.9448 \| -0.5936 \| -0.7693 \|
	\| 0.3444 \| 0.6413 \| 700 \| 0.3571 \| 0.5871 \| -0.5028 \| 0.9104 \| 1.0898 \| -274.7784 \| -128.3763 \| -0.5965 \| -0.7721 \|
	\| 0.3486 \| 0.7329 \| 800 \| 0.3427 \| 0.5681 \| -0.6155 \| 0.9104 \| 1.1836 \| -277.0341 \| -128.7559 \| -0.5971 \| -0.7725 \|
	\| 0.3317 \| 0.8246 \| 900 \| 0.3349 \| 0.5586 \| -0.6739 \| 0.9133 \| 1.2326 \| -278.2013 \| -128.9451 \| -0.5993 \| -0.7748 \|
	\| 0.3077 \| 0.9162 \| 1000 \| 0.3328 \| 0.5530 \| -0.6974 \| 0.9075 \| 1.2504 \| -278.6715 \| -129.0585 \| -0.5998 \| -0.7754 \|


	### Framework versions

	- PEFT 0.12.0
	- Transformers 4.44.0
	- Pytorch 2.3.1+cu121
	- Datasets 2.20.0
	- Tokenizers 0.19.1