Training guide/requirements

#1
by Muhammadreza - opened

Well since your model is based on Mistral, I have a few questions:

  1. Have you taken the 7B model and trimmed the fat to get this? Or this is a coded model from scratch?
  2. What steps are necessary in order to train this model?
  3. What hardware do you recommend for training the model?

Thanks.

Hi,
Thank you for your interest.
I basically trimmed most of the layers from Mistral to get this model.
If you'd like to train a completely new model, I'd look into LLM Shearing. If you'd like to finetune this model, I'd recommend using Axolotl and perhaps The Pile or Falcon RefinedWeb.
As for the hardware, this can fit on many GPUs since it's really small. You could use a 3090 or 4090, or for faster training H100s.
However, if you're looking for a good pre-trained model that does not require any further finetuning, I'd recommend TinyLlama.
I hope this helps!

Thanks @mrfakename ! For now, I'm more interested in trimming the model. Is there any document or guide on that?
We have our own 7B model, based on mistral and trained for our own language. I guess it could be nice if we can make it small as well!

Hi,
Unfortunately there’s no good way to trim a pre trained model (as far as I know). Personally, my best guess would be to use LLM Shearing, however this method requires further pretraining.
However, if you just want a model that can run on weaker hardware, I’d recommend quantization using bitsandbytes or llama.cpp.

i have also created trimmed models ! but they also are not great models ! : they can be with training - ie a small dataset of 25 examples : trained on 100 epoch until overfit !
then a large text generation corpus! to get some form of understandable output:

When trimming models its best to take the last layers of the model ! (Important) as by this time most of the weights were in a learned position : so feeding a good input at this stage should also be good for easy training : but still not advised as we need to insatiate the model from the config as we also need 32 layers to be a true model !
we need to make the size of the layers smaller instead of pruning the model ! Then a pre-exisiting lora can be applied to the fresh model ! giving the new model a warm start! (not trained at all !) but open for its FIRST TRAIN! (the most important) - hence the small dataset overfit to the data . the following fine tuning will be by lora etc ... hence its base layer will be fit to the shape of the original task prompt (ie instruct) or (Question & answer) !! most datasets are basically providing the q&A wth an extra prompt attached explaining the task : hence the task should be the same task only and not multiple different prompts ! ie for Q&A the prompt should be simply such as answer the question! (without extra flourishes) ...after it is satisfactoryly tuned (producing nice language model outputs) its ready for the random tuning datasets! ie: general tasks/chat/code/etc:::

as transformers are for predicting the next word its first task should be text generation only: as it builds the probability matrixes for prediction based on the content (sliding window) .... so after these are producign good sentences its ready for learning!

i merge kit ! and chose only 16 layer for 3b

cool! there's also a novel method implemented here that will automatically delete the least important layers (not used here, this method was released very recently!)

Sign up or log in to comment