Finetune Mixtral 8x7B with AutoTrain

Community Article Published April 1, 2024

In this blog, I'll show you how you can fine-tune Mixtral 8x7B on your own dataset using AutoTrain. The amount of coding used in this blog post will be quite small. We will be writing zero lines of code!

Since Mixtral is quite large model, it requires quite large hardware to finetune on. For this post, we will be using the latest offering of Hugging Face: Train on DGX Cloud. However, note that, you can use the process followed in this blog post and train on your own hardware (or other cloud providers) too! Steps to train locally/custom hardware are also provided in this blog.

NOTE: Train on DGX cloud is only available for Enterprises

To finetune mixtral-8x7b instruct on your custom dataset, you can click here and then click on the train button, you will be shown a few options, you need to select "NVIDIA DGX cloud".

image/png

Once done, an AutoTrain space will be created for you, where you can upload your data, select parameters and start training.

If running locally, all you have to do is install AutoTrain and start the app:

$ pip install -U autotrain-advanced
$ export HF_TOKEN=your_huggingface_write_token
$ autotrain app --host 127.0.0.1 --port 8080

Once done, you can go to your browser, 127.0.0.1:8080 and now you are ready to finetune locally.

If running on DGX Cloud, you will see the option to choose 8xH100 in the hardware dropdown. This dropdown will be disabled if running locally:

image/png

As you can see, the AutoTrain UI offers a lot of options for different types of tasks, datasets and parameters. One can train almost any kind of model on their own dataset using AutoTrain 💥 If you are an advanced user and want to tune more paramters all you have to do is click on Full under the training parameters!

The more the parameters, the more confusing it is for end-users. Today, we are limiting ourselves to basic parameters. 99% of the time, basic parameters are all you need to adjust to get a model that performs amazingly well in the end 😉 Offering more than what's required is just confusing for end-users.

Today, we are selecting the no_robots dataset from the Hugging Face H4 team. You can see the dataset here. This is how one of the samples of the dataset looks like:

[ { "content": "Please summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. “We can’t ask restoration ecologists to plant nonnative species or to just take their best guess and throw things out there,” says Rinkert.", "role": "user" }, { "content": "Scientists are studying nests hoping to learn about transitional habitats that could help restore the shoreline of San Francisco Bay.", "role": "assistant" } ]

This dataset is pretty much the standard for SFT training. If you want to train your own conversational bot on your custom dataset, this would be the format to follow! 🤗 Thanks to the H4 team!

Now, we have the dataset and AutoTrain UI up and running. All we need to do now is point the UI to the dataset, adjust the parameters and click on "Start" button. Here's a view of the UI right before clicking on "Start" button.

image/png

We chose Hugging Face hub dataset and changed the following:

  • dataset name: HuggingFaceH4/no_robots
  • train split: train_sft, this is how split is named in this specific dataset.
  • column mapping: {"text": "messages"}. this maps autotrain's text column to the text column in the dataset which is messages in our case

For parameters, the following worked well:

{
  "block_size": 1024,
  "model_max_length": 2048,
  "mixed_precision": "bf16",
  "lr": 0.00003,
  "epochs": 3,
  "batch_size": 2,
  "gradient_accumulation": 4,
  "optimizer": "adamw_bnb_8bit",
  "scheduler": "linear",
  "chat_template": "zephyr",
  "target_modules": "all-linear",
  "peft": false
}

Here we are using adamw_bnb_8bit optimizer and zephyr chat template. Depending on your dataset, you can use zephyr, chatml or tokenizer chat template. Or you can set it to none and format the data the way you like before uploading to AutoTrain: possibilities are endless and infinite.

Note that we are not using quantization for this specific model, PEFT has been disabled 💥

Once done, click on the "Start" button, grab a coffee and relax.

When I tried this with the same params and dataset on 8xH100, the training took ~45mins (3 epochs) and my model was pushed to hub as a private model for me to try instantly 🚀 If you want, you can take a look at the trained model here.

Great! So, we finetuned mixtral 8x7b instruct model on our own custom dataset and the model is ready to be deployed using Hugging Face's inference endpoints.

BONUS: if you like CLI, here is the command to run:

autotrain llm \
--train \
--trainer sft \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--data-path HuggingFaceH4/no_robots \
--train-split train_sft \
--text-column messages \
--chat-template zephyr \
--mixed-precision bf16 \
--lr 2e-5 \
--optimizer adamw_bnb_8bit \
--scheduler linear \
--batch-size 2 \
--epochs 3 \
--gradient-accumulation 4 \
--block-size 1024 \
--max-length 2048 \
--padding right \
--project-name autotrain-xva0j-mixtral8x7b
--username abhishek \
--push-to-hub

In case of questions, reach out at autotrain@hf.co or on twitter: @abhi1thakur