Fine-tuning toolkit for Mixtral 8x7B MoE model

#10

by hiyouga - opened Dec 12, 2023

Discussion

hiyouga

Dec 12, 2023

•

edited Dec 12, 2023

It only requires 28GB to fine-tune the 8x7B model with LLaMA Factory.

We adopt 4-bit quantization, LoRA adapters and FlashAttention-2 to save the GPU memory.

Try out https://github.com/hiyouga/LLaMA-Factory

hiyouga

Dec 12, 2023

Mixtral-8x7B fine-tuned on the Alpaca dataset, preliminary results:

hiyouga

Dec 12, 2023

Remarkable reasoning abilities:

kustoll

Dec 15, 2023

This sounds great! Could you kindly provide your command line parameters and a Deepspeed config to run it on multiple H100s?

shahules786

Dec 16, 2023

This is great @hiyouga . I wonder how efficient the training will be, especially with sparse models, and how issues like token dropping will be addressed.

noforit

Dec 17, 2023

This is great! Thanks for sharing, but i have an issue when adopt 8-bit quantization, LoRA adapters and FlashAttention-2 with Mixtral 8x7B MoE model.
there's an error : RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
could you help me?

hiyouga

Dec 18, 2023

@noforit it looks like there are some issues in 8-bit quantization, we recommend using 4-bit quantization instead

noforit

Dec 19, 2023

@noforit it looks like there are some issues in 8-bit quantization, we recommend using 4-bit quantization instead

As you say, thanks, I use 4-bit instead and it works

baptistejamin

Dec 20, 2023

I saw some comments showing that quantization was an issue to leverage with Mixtral MOE.

Mixtral routes each token to experts. Quantization can reduce probability for each token, resulting routing to only go a small portion of experts.

giannigi

Dec 23, 2023

Can you share a mini-guide on the steps necessary to perform the taining, or share the commands and configs used? thanks

hiyouga

Dec 24, 2023

@giannigi

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path mistralai/Mixtral-8x7B-v0.1 \
    --dataset alpaca_en \
    --template mistral \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir mixtral \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --quantization_bit 4 \
    --bf16

sumegh

Jan 6, 2024

•

edited Jan 8, 2024

i'm following the same and doing a 4-bit LoRA finetuning on a custom dataset. Tried changing templates between Alpaca & Mistral. But my training loss diverges after 1k steps or so. Any ideas ?

Reference Notebook - https://colab.research.google.com/drive/1VDa0lIfqiwm16hBlIlEaabGVTNB3dN1A?usp=sharing

cc - @hiyouga

Villian7

Jan 6, 2024

What are the minimum compute resources required to train the model?

aigeek0x0

Jan 10, 2024

@hiyouga

can you please explain why you only targeted "q_proj,v_proj" layers?

i have come across some arguments opposing the idea of fine-tuning all linear layers/gates/routers. I would greatly appreciate it if someone could provide a more detailed explanation on this matter.

thank you.

hiyouga

Jan 10, 2024

@aigeek0x0
We used q_proj,v_proj modules just to estimate the minimum resource usage. It is recommended to use all linear layers with LoRA adapters for better fitting.

Villian7

Jan 10, 2024

•

edited Jan 10, 2024

@aigeek0x0
You can specify and finetune only the linear layers of any LLM model while using LORA. When you print(model) you will get the layers in that some people use only attention layers such as q, k, v, o in case of Mistral or some people use all linear layers. I'm exactly not sure how it affects the performance, but it will surely reduce the RAM size of the peft model and it is of very small amount.

chsafouane

Jan 15, 2024

It's still not clear for me whether one should also finetune the routers. Any resources discussing this ?

ivsanro1

Mar 4, 2024

•

edited Mar 4, 2024

I was also wondering about the need of fine tuning routers. Intuitively it does not make much sense to fine tune the routers with the proj layers, because it can make training unstable, as you’d be fine tuning both representations and routers, and changes in one can affect the other and vice versa.

Other way to see it is that if you change the expert(s) for a given token, you’re losing very valuable information from the base model, and changing the routing decision e.g. from experts (1,3) at a given layer to experts e.g. (4,6) would have much of a bigger, sudden impact than a small, gradual update of the proj matrices in every update step.

But all of this is just speculation, and can be task-specific.

pierrerichard

Mar 13, 2024

Is it possible to run the fine-tuning of Mixtral with LLaMa-Factory on CPU only or on both GPU and CPU (my GPU is 16 GB of VRAM)?
Thanks a lot!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment