File size: 2,546 Bytes
352cc88
 
 
 
 
 
 
 
 
bcd25f8
023161c
 
dd7eb79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
023161c
54d0bf8
c49e8ec
54d0bf8
 
 
023161c
 
 
 
352cc88
adcf38b
54d0bf8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
datasets:
- anon8231489123/ShareGPT_Vicuna_unfiltered
language:
- zh
- en
---

*TODO:Upload pending, training is finished. still testing.
*Update: Having a bit issue with the tokenizer, still figuring things out.


Reproduce Vicuna, but based on yi-6B. The training data I used was ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json. 

Hyper parameters:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,5 torchrun --nproc_per_node 5 ../supervised_finetuning.py \
    --model_type auto \
    --model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
    --tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
    --train_file_dir ../data/finetune/vicuna/ \
    --per_device_train_batch_size 2\
    --do_train \
    --max_train_samples -1 \
    --num_train_epochs 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --bf16 \
    --use_peft False \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy epoch \
    --save_total_limit 5 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 8 \
    --output_dir ../outputs/20240106_yi6B_vicuna \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --torch_dtype bfloat16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --cache_dir ./cache \
    --model_max_length 4096 \
    --deepspeed ../deepspeed_zero_stage2_config_no16.json \
    --template_name yi   
```

The training used 5*A800 for 3 epochs
```
***** train metrics *****
  epoch                    =                3.0
  train_loss               =             0.3785
  train_runtime            = 1 day, 10:01:13.95
  train_samples            =              93204
  train_samples_per_second =               2.24
  train_steps_per_second   =              0.224
```

We can see from some preliminary results, the conversation is natural and informative (unsurprisingly).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/WfQYyyLxtXA2KlePmIPQJ.png)

Also we observe the unfiltering seems to be working! **Heads up** some examples are unsafe and inappropriate, this is entirely for research purposes, to test how alignment-filtered SFT data affect LLM's final output.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/pklSsljCRN34QuL2ZF2zU.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/22pTSVkBCVlQ5N8A8JBkF.png)