yi6B_Vicuna / README.md

Adding Evaluation Results (#2)

dc0317d verified 7 months ago

6.59 kB

	---
	language:
	- en
	license: mit
	datasets:
	- anon8231489123/ShareGPT_Vicuna_unfiltered
	model-index:
	- name: yi6B_Vicuna
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 46.16
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 69.3
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 58.43
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 48.11
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 65.67
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 18.42
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
	name: Open LLM Leaderboard
	---


	Bug: Having a bit issue with the tokenizer, still figuring out...You can use the original Yi tokenizer configuratin.


	Reproduce Vicuna, but based on yi-6B. The training data I used was ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json.

	The training framework I used https://github.com/shibing624/MedicalGPT , train shell:
	```
	CUDA_VISIBLE_DEVICES=0,1,2,3,5 torchrun --nproc_per_node 5 ../supervised_finetuning.py \
	--model_type auto \
	--model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
	--tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
	--train_file_dir ../data/finetune/vicuna/ \
	--per_device_train_batch_size 2\
	--do_train \
	--max_train_samples -1 \
	--num_train_epochs 3 \
	--learning_rate 2e-5 \
	--weight_decay 0. \
	--bf16 \
	--use_peft False \
	--logging_strategy steps \
	--logging_steps 10 \
	--save_strategy epoch \
	--save_total_limit 5 \
	--gradient_accumulation_steps 1 \
	--preprocessing_num_workers 8 \
	--output_dir ../outputs/20240106_yi6B_vicuna \
	--overwrite_output_dir \
	--ddp_timeout 30000 \
	--logging_first_step True \
	--torch_dtype bfloat16 \
	--device_map auto \
	--report_to tensorboard \
	--ddp_find_unused_parameters False \
	--gradient_checkpointing True \
	--cache_dir ./cache \
	--model_max_length 4096 \
	--deepspeed ../deepspeed_zero_stage2_config_no16.json \
	--template_name yi
	```

	The training used 5*A800 for 3 epochs
	```
	*** train metrics ***
	epoch = 3.0
	train_loss = 0.3785
	train_runtime = 1 day, 10:01:13.95
	train_samples = 93204
	train_samples_per_second = 2.24
	train_steps_per_second = 0.224
	```

	Post-training inference is also using this repository:
	```
	CUDA_VISIBLE_DEVICES=4 python gradio_demo.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --tokenizer_path /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 4
	CUDA_VISIBLE_DEVICES=6 python inference.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 6 --interactive --tokenizer_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B
	```

	We can see from some preliminary results, the conversation is natural and informative (unsurprisingly).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/WfQYyyLxtXA2KlePmIPQJ.png)

	Also we observe the unfiltering seems to be working! Heads up some examples are unsafe and inappropriate, this is entirely for research purposes, to test how alignment-filtered SFT data affect LLM's final output.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/pklSsljCRN34QuL2ZF2zU.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/22pTSVkBCVlQ5N8A8JBkF.png)

	Update: Evaluate on Open LLM Leaderboard:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/Xp11HLQqwh0HMSJgpr19n.png)
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_lorinma__yi6B_Vicuna)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|51.02\|
	\|AI2 Reasoning Challenge (25-Shot)\|46.16\|
	\|HellaSwag (10-Shot) \|69.30\|
	\|MMLU (5-Shot) \|58.43\|
	\|TruthfulQA (0-shot) \|48.11\|
	\|Winogrande (5-shot) \|65.67\|
	\|GSM8k (5-shot) \|18.42\|