Fine-tuning toolkit for Mixtral 8x7B MoE model

#10
by hiyouga - opened

It only requires 28GB to fine-tune the 8x7B model with LLaMA Factory.

We adopt 4-bit quantization, LoRA adapters and FlashAttention-2 to save the GPU memory.

Try out https://github.com/hiyouga/LLaMA-Factory

GBHtAldbIAAYisZ.png
GBHtHiCbgAA8gVk.png

Mixtral-8x7B fine-tuned on the Alpaca dataset, preliminary results:

20231212164640.jpg

Remarkable reasoning abilities:

20231212165413.jpg

This sounds great! Could you kindly provide your command line parameters and a Deepspeed config to run it on multiple H100s?

This is great @hiyouga . I wonder how efficient the training will be, especially with sparse models, and how issues like token dropping will be addressed.

This is great! Thanks for sharing, but i have an issue when adopt 8-bit quantization, LoRA adapters and FlashAttention-2 with Mixtral 8x7B MoE model.
there's an error : RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
could you help me?

image.png

@noforit it looks like there are some issues in 8-bit quantization, we recommend using 4-bit quantization instead

@noforit it looks like there are some issues in 8-bit quantization, we recommend using 4-bit quantization instead

As you say, thanks, I use 4-bit instead and it works

I saw some comments showing that quantization was an issue to leverage with Mixtral MOE.

Mixtral routes each token to experts. Quantization can reduce probability for each token, resulting routing to only go a small portion of experts.

Can you share a mini-guide on the steps necessary to perform the taining, or share the commands and configs used? thanks

@giannigi

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path mistralai/Mixtral-8x7B-v0.1 \
    --dataset alpaca_en \
    --template mistral \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir mixtral \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --quantization_bit 4 \
    --bf16

i'm following the same and doing a 4-bit LoRA finetuning on a custom dataset. Tried changing templates between Alpaca & Mistral. But my training loss diverges after 1k steps or so. Any ideas ?

Reference Notebook - https://colab.research.google.com/drive/1VDa0lIfqiwm16hBlIlEaabGVTNB3dN1A?usp=sharing

cc - @hiyouga

What are the minimum compute resources required to train the model?

@hiyouga

can you please explain why you only targeted "q_proj,v_proj" layers?

i have come across some arguments opposing the idea of fine-tuning all linear layers/gates/routers. I would greatly appreciate it if someone could provide a more detailed explanation on this matter.

thank you.

@aigeek0x0
We used q_proj,v_proj modules just to estimate the minimum resource usage. It is recommended to use all linear layers with LoRA adapters for better fitting.

@aigeek0x0
You can specify and finetune only the linear layers of any LLM model while using LORA. When you print(model) you will get the layers in that some people use only attention layers such as q, k, v, o in case of Mistral or some people use all linear layers. I'm exactly not sure how it affects the performance, but it will surely reduce the RAM size of the peft model and it is of very small amount.

It's still not clear for me whether one should also finetune the routers. Any resources discussing this ?

I was also wondering about the need of fine tuning routers. Intuitively it does not make much sense to fine tune the routers with the proj layers, because it can make training unstable, as you’d be fine tuning both representations and routers, and changes in one can affect the other and vice versa.

Other way to see it is that if you change the expert(s) for a given token, you’re losing very valuable information from the base model, and changing the routing decision e.g. from experts (1,3) at a given layer to experts e.g. (4,6) would have much of a bigger, sudden impact than a small, gradual update of the proj matrices in every update step.

But all of this is just speculation, and can be task-specific.

Is it possible to run the fine-tuning of Mixtral with LLaMa-Factory on CPU only or on both GPU and CPU (my GPU is 16 GB of VRAM)?
Thanks a lot!

Sign up or log in to comment