yi6B_Vicuna / README.md
lorinma's picture
Adding Evaluation Results (#2)
dc0317d verified
|
raw
history blame
6.59 kB
---
language:
- en
license: mit
datasets:
- anon8231489123/ShareGPT_Vicuna_unfiltered
model-index:
- name: yi6B_Vicuna
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 46.16
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 69.3
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 58.43
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 48.11
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.67
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 18.42
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna
name: Open LLM Leaderboard
---
**Bug**: Having a bit issue with the tokenizer, still figuring out...You can use the original Yi tokenizer configuratin.
Reproduce Vicuna, but based on yi-6B. The training data I used was ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json.
The training framework I used https://github.com/shibing624/MedicalGPT , train shell:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,5 torchrun --nproc_per_node 5 ../supervised_finetuning.py \
--model_type auto \
--model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
--tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
--train_file_dir ../data/finetune/vicuna/ \
--per_device_train_batch_size 2\
--do_train \
--max_train_samples -1 \
--num_train_epochs 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--bf16 \
--use_peft False \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy epoch \
--save_total_limit 5 \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 8 \
--output_dir ../outputs/20240106_yi6B_vicuna \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--torch_dtype bfloat16 \
--device_map auto \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True \
--cache_dir ./cache \
--model_max_length 4096 \
--deepspeed ../deepspeed_zero_stage2_config_no16.json \
--template_name yi
```
The training used 5*A800 for 3 epochs
```
***** train metrics *****
epoch = 3.0
train_loss = 0.3785
train_runtime = 1 day, 10:01:13.95
train_samples = 93204
train_samples_per_second = 2.24
train_steps_per_second = 0.224
```
Post-training inference is also using this repository:
```
CUDA_VISIBLE_DEVICES=4 python gradio_demo.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --tokenizer_path /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 4
CUDA_VISIBLE_DEVICES=6 python inference.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 6 --interactive --tokenizer_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B
```
We can see from some preliminary results, the conversation is natural and informative (unsurprisingly).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/WfQYyyLxtXA2KlePmIPQJ.png)
Also we observe the unfiltering seems to be working! **Heads up** some examples are unsafe and inappropriate, this is entirely for research purposes, to test how alignment-filtered SFT data affect LLM's final output.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/pklSsljCRN34QuL2ZF2zU.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/22pTSVkBCVlQ5N8A8JBkF.png)
**Update:** Evaluate on Open LLM Leaderboard:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/Xp11HLQqwh0HMSJgpr19n.png)
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_lorinma__yi6B_Vicuna)
| Metric |Value|
|---------------------------------|----:|
|Avg. |51.02|
|AI2 Reasoning Challenge (25-Shot)|46.16|
|HellaSwag (10-Shot) |69.30|
|MMLU (5-Shot) |58.43|
|TruthfulQA (0-shot) |48.11|
|Winogrande (5-shot) |65.67|
|GSM8k (5-shot) |18.42|