How to save fine-tuned model and load it again?

by gabriead - opened May 5, 2023

May 5, 2023

•

edited May 5, 2023

Hi @TheBloke I have used your model "TheBloke/wizardLM-7B-HF" and fine-tuned it with the train_freeform.py script from the wizardLM Gihub repo on a custom dataset. The training works fine but the pytorch_model.bin is only a few Kilobytes so something is off with saving model.? I am using "safe_save_model_for_hf_trainer(..)" in their script for saving. Did you experience something similar? Any suggestions of what I could try

TheBloke

Owner May 5, 2023

•

edited May 5, 2023

I'm not an expert on training/fine-tuning, but I have tested re-creating the WizardLM fine tuning method, using their original dataset and relative to Llama 7B. If you were to follow that method you'd take their original dataset and add yours to it, and re-run the whole thing. But that takes ~15 hours on 8 x A6000 GPUs.

Instead what you most likely want to do is train a LoRA. What that does is 'freezes' the weights in the already-trained WizardLM model, and then trains new weights on top. The resulting file will be very small - maybe 50-100MB. But you can then load WizardLM + your LoRA and get the results of both.

Here's a short introductory video, with code, on LoRA fine tuning: https://youtu.be/Us5ZFp16PaU

If you'd like a place to talk about this in more detail, with people who have done this already, join us on the Alpaca Lora Discord: https://discord.gg/ZMHkCGy9

gabriead

May 5, 2023

Hi Tom,
thank's a lot for your response. Just to better understand what I might be overlooking: shouldn't I be able of taking your model and further fine-tune it on my custom data using their training script? What am I missing here? Or is it because I didn't recover their original model weights and used them for fine-tuning? Can I use your model as base model and use the alpaca training script that includes the LoRa technique?

TheBloke

Owner May 5, 2023

•

edited May 5, 2023

You certainly can use the original WizardLM code, just expect it to take a long time and require multiple big GPUS.

The scripts provided by WizardLM are for a full training. They load the original model (they assume you load base Llama 7B, but you could load WizardLM 7B instead) and then train the specified dataset on top of it. It's arguably the best method for training, but it's also slow and expensive.

The LoRA method is much quicker and doesn't require nearly as much hardware. This is because it freezes the base weights and then trains new weights on top of them. Whereas a full fine tuning might need 8 x A6000 or 4 x A100 GPUs, a LoRA fine tune could be done on a single 3090 or 4090 (and other GPUs). The result may not be quite as good, but it still works and it's much more affordable and accessible.

Yes, you can use the Alpaca Lora code on top of this model. That will produce a LoRA fine tuning that starts with WizardLM 7B and then fine tunes your own data on top. That's probably the best place to start.

gabriead

May 5, 2023

Thank's again for your feedback! As I have already used your model (wizardLM-7B-HF) and fine tuned it with their training script I wonder if you have an idea why the weights/checkpoints that produces are not usabel? This approach results in a pytorch_model.bin that can not be used for inference for some reason. When I load the checkpoint it throws an error (File "/anaconda/envs/llamax/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse RuntimeError: 'weight' must be 2-D)

TheBloke

Owner May 5, 2023

I don't know, but if the file is only a few KB it definitely didn't run properly. Do you have the output log from when it ran?

How long did it take? It should have been in the region of 12 - 100 hours, depending on your hardware

gabriead

May 5, 2023

If I undestand your model card correctly the wizard-7B-HF is the result of producing the delta weights right so fine tuning this should produce a usable model again?

TheBloke

Owner May 5, 2023

Yes you should be able to run a fine tuning on top of this HF format model. I've not tried it specifically, but it should work.

The resulting output files would then be the same size as the best model, ie around 13GB

gabriead

May 5, 2023

•

edited May 5, 2023

I am using an A-100 and it took 3 hours with the original data. For the log below I was only using 1 sample in their data set and 1 epoch to make is fast as possible to show the training process. From my understanding the training seems to be running just fine but maybe I am overlooking something.
Original command on terminal:
"deepspeed train_freeform.py \

--model_name_or_path "TheBloke/wizardLM-7B-HF" \
--data_path alpaca_evol_instruct_1.json \
--output_dir finetuned_model\
--num_train_epochs 1 \
--model_max_length 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 800 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--warmup_steps 2 \
--logging_steps 2 \
--lr_scheduler_type "cosine" \
--report_to "tensorboard" \
--gradient_checkpointing True \
--deepspeed deepspeed_config.json \
--fp16 True"
[2023-05-05 08:10:06,103] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-05 08:10:06,103] [INFO] [runner.py:550:main] cmd = /anaconda/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_freeform.py --model_name_or_path TheBloke/wizardLM-7B-HF --data_path alpaca_evol_instruct_1.json --output_dir finetuned_model --num_train_epochs 1 --model_max_length 512 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 800 --save_total_limit 3 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed deepspeed_config.json --fp16 True
[2023-05-05 08:10:07,676] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-05-05 08:10:07,676] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-05-05 08:10:07,676] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-05-05 08:10:07,677] [INFO] [launch.py:162:main] dist_world_size=4
[2023-05-05 08:10:07,677] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-05-05 08:10:37,866] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.26s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.26s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.26s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.26s/it]
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[2023-05-05 08:10:55,895] [WARNING] [cpu_adam.py:85:init] FP16 params for CPUAdam may not work on AMD CPUs
[2023-05-05 08:10:55,896] [WARNING] [cpu_adam.py:85:init] FP16 params for CPUAdam may not work on AMD CPUs
[2023-05-05 08:10:55,920] [WARNING] [cpu_adam.py:85:init] FP16 params for CPUAdam may not work on AMD CPUs
[2023-05-05 08:10:55,920] [WARNING] [cpu_adam.py:85:init] FP16 params for CPUAdam may not work on AMD CPUs
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...

Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/azureuser/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.449831962585449 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4798665046691895 seconds
Time to load cpu_adam op: 2.4798264503479004 seconds
Time to load cpu_adam op: 2.4801440238952637 seconds
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/azureuser/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.08765268325805664 seconds
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/azureuser/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.059262752532958984 seconds
Loading extension module utils...
Time to load utils op: 0.10180521011352539 seconds
Loading extension module utils...
Time to load utils op: 0.10201025009155273 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006754398345947266 seconds
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006668567657470703 seconds
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0018012523651123047 seconds
Using /home/azureuser/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003707408905029297 seconds
{'train_runtime': 18.7022, 'train_samples_per_second': 0.053, 'train_steps_per_second': 0.053, 'train_loss': 0.024322509765625, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.50s/it]
[2023-05-05 08:13:25,962] [INFO] [launch.py:350:main] Process 27755 exits successfully.
[2023-05-05 08:13:27,965] [INFO] [launch.py:350:main] Process 27754 exits successfully.
[2023-05-05 08:13:27,965] [INFO] [launch.py:350:main] Process 27753 exits successfully.
[2023-05-05 08:13:28,966] [INFO] [launch.py:350:main] Process 27756 exits successfully.

TheBloke

Owner May 5, 2023

Based on the progress bar at the end it only trained one row of data, which is why it took 18 seconds instead of 12+ hours for 1 epoch :) When I did a quick test on 4 x A100 80GB the ETA was 35 hours for 3 epochs on their original 70k WizardLM dataset

show me ls -al alpaca_evol_instruct_1.json

gabriead

May 5, 2023

Yeah exactly (there is just one sample in the data set regarding the log above) I don't have the original log (when I was training on my full dataset) that's why I quickly reproduced it to show the output.
For the full fledged "alpaca_evol_instruct_70k.json" I can confirm what you said earlier it will take 35 hours (see log below) using your model. So long story short: I think that in theory the training is running correctly. The question that remains is why is it not worring for my specific data set? I have currently no idea how to debug this.

This is the log using your model and the alpaca_evol_instruct_70k.json.
deepspeed train_freeform.py \

--model_name_or_path "TheBloke/wizardLM-7B-HF" \
--data_path alpaca_evol_instruct_70k.json \
--output_dir finetuned_model\
--num_train_epochs 3 \
--model_max_length 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 800 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--warmup_steps 2 \
--logging_steps 2 \
--lr_scheduler_type "cosine" \
--report_to "tensorboard" \
--gradient_checkpointing True \
--deepspeed deepspeed_config.json \
--fp16 True
[2023-05-05 10:48:06,422] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-05 10:48:06,422] [INFO] [runner.py:550:main] cmd = /anaconda/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_freeform.py --model_name_or_path TheBloke/wizardLM-7B-HF --data_path alpaca_evol_instruct_70k.json --output_dir finetuned_model --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 800 --save_total_limit 3 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed deepspeed_config.json --fp16 True
[2023-05-05 10:48:07,879] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-05-05 10:48:07,879] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-05-05 10:48:07,879] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-05-05 10:48:07,879] [INFO] [launch.py:162:main] dist_world_size=4
[2023-05-05 10:48:07,879] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 555/555 [00:00<00:00, 1.32MB/s]
Downloading (…)model.bin.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 56.3MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [00:18<00:00, 548MB/s]
Downloading (…)l-00002-of-00002.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:06<00:00, 537MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.53s/it]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.52s/it]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.52s/it]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.54s/it]
[2023-05-05 10:49:10,190] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.48s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.48s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.49s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.49s/it]
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 227kB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 727/727 [00:00<00:00, 1.45MB/s]
Downloading tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 91.5MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 57.6kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 435/435 [00:00<00:00, 857kB/s]
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Loading data...
.......................

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004246234893798828 seconds
{'loss': 0.1972, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.2099, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.2171, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.2163, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.2404, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.2244, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.1907, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.2223, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.2312, 'learning_rate': 0.0, 'epoch': 0.01}
0%|▍ | 19/6564 [05:01<29:32:06, 16.25s/it]

gabriead

May 5, 2023

If I use a very small fraction the alpaca_evol_instruct_70k.json data set (say five samples) and fine-tune it should produce a pytorch_model.bin that I should be able to load again and do inference with correct?

TheBloke

Owner May 5, 2023

•

edited May 5, 2023

OK I see. And yes I think that should work. Although I've never actually gone as far as the full training, I just started it and let it run for 30-60 mins to get an idea of how it worked.

I know a couple of people who have completed the full training, such as @ehartford , and I know they tweaked the code a bit in certain places. Maybe the final model saving was one of those.

Why don't you come to the Alpaca Lora Discord as it'll be easier to get support there. https://discord.gg/eSdptpkm

gabriead

May 5, 2023

Thank's a lot for your effort so far! Yes then I will join you guys in Discord :-)

CR2022

May 16, 2023

How can we join the Alpaca Lora Discord? Because the link is no longer valid.

TheBloke

Owner May 16, 2023

Here you go https://discord.gg/Rh9e8MfH

waytohou

Jul 4, 2023

Hi @TheBloke I have used your model "TheBloke/wizardLM-7B-HF" and fine-tuned it with the train_freeform.py script from the wizardLM Gihub repo on a custom dataset. The training works fine but the pytorch_model.bin is only a few Kilobytes so something is off with saving model.? I am using "safe_save_model_for_hf_trainer(..)" in their script for saving. Did you experience something similar? Any suggestions of what I could try

hi，gabriead, i meet the same problems here, the pytorch_model.bin is only a few MB that can't be load correctly, it always show the 'weight' must be 2-D error, i don't know how to do with it? Did you solve it in the end? if you solved it, can you tell me how?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment