How to format custom dataset to finetune Mixtral with TRL SFT script?

#132
by icpro - opened

I am trying to fine-tune Mixtral on a custom dataset.

I adapted https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py to my use case.

Here is how I formatted my dataset. For each row of my CSV, there is a textcolumn which looks like this:

[INST] User text hello [/INST] Model answer hi

Is it the good format, and if not, how should I change it? I wonder if I should add some EOS/BOS tokens.

What also questions me is that in this article : https://huggingface.co/blog/mixtral#fine-tuning-with-%F0%9F%A4%97-trl, the example dataset used is https://huggingface.co/datasets/trl-lib/ultrachat_200k_chatml, which is in ChatML format. Is there a mistake in the article?

Other question: is a test dataset needed in the case of fine-tuning a LM like Mixtral?

Sign up or log in to comment