Text Generation
Transformers
Safetensors
English
mistral
conversational
Eval Results
Inference Endpoints
text-generation-inference

How was this fine-tuned?

#8
by Locutusque - opened

What code was used to fine-tune this? What was the learning rate? The performance is amazing.

Thank you, @Locutusque !

This one is a bit of a mutant. Everything I wanted to test about fine-tuning LLMs I tried on it, and also restarted the training several times from checkpoints, so I can't say it was trained with specific parameters.
I used TinyMistral-248M-SFT-v3 as base (which is based on TinyMistral-248M@90b89d18fdf27937dc04ab8a9b543c5af2991c7f)
The learning rates I tested were from 5e-4 to 7-e6. But I feel the sweet spot for this model is learning_rate=2e-5.
I also tried several batch sizes, from 1 to 512. But the last training was made on 32 for several hours (per_device_train_batch_size=2 & gradient_accumulation_steps=16).
I'm training the full model in float32 using only SFTTrainer. Note: TinyMistral-248M-SFT-v3 was trained with AutoTrain Advanced, but I dropped it on v4 because I wanted to make use of NEFTune and they didn't support it yet.
I set neftune_noise_alpha=5 as recommended on SFTTrainer page and didn't tweak it. Fortunately, it seems to have helped the learning.
I've used max_seq_length=2048 this time, and it seems it gave it a boost. (v3 was trained with 1024)
It's also worth mentioning that I used weight_decay=0.01 (it was 0 on v3) and lr_scheduler_type="cosine" (it was "constant" in v3).
I used evaluation_strategy="steps" and set the eval_steps to the same value as save_steps, and set a low number for the steps, so I could get progress feedback every ~10 minutes.
Before starting each training session, the dataset was shuffled to avoid getting overtrained on the first rows (due to the constant restarts).

I believe that sums up how it was trained!

Locutusque changed discussion status to closed

Sign up or log in to comment