How was this made?

#1
by Jcuhfehl - opened

Hi, I wanted to know how this model was created. Is it a sheared/pruned version of mistral 7b, or is it a from-scratch model that shares the same architecture?

Hey there!
This model is trained from scratch with the same architecture as Mistral. I've done this as an attempt to demonstrate that trillion-scale datasets are not absolutely necessary to pretrain language models, and as a result, they can be trained on a single GPU.

I'm interested, what GPU was this trained on and how long did it take?

I'm also interested. Cool model!

I actually have not finished training it yet, and I'm estimating around 5 more training days/sessions.
I've been using a single Titan V (which is a budget V100 if you don't know what it is)

So far it has taken about 48-72 GPU hours to train, which in comparison to other models, is very good.

Thanks, do you think it would be possible to do it on a rtx 4080? Or is it too weak

Also interested what framework your using to train it

Thanks, do you think it would be possible to do it on a rtx 4080? Or is it too weak

The RTX 4080 should for sure be enough for it, you may be able to train a model that is significantly better than this one with the RTX 4080

Also interested what framework your using to train it

I'm using PyTorch to train this model.

Any specific library or your own written training code?

Any specific library or your own written training code?

I only use wandb, transformers (to load the model and pick up where I left off), and maybe a few others for utils. I write my own training and evaluation loop.

Would you please consider open sourcing the code? I want to mess with it :D

Would you please consider open sourcing the code? I want to mess with it :D

I'll consider it, I don't see why I shouldn't other than the fact it's super messy and is probably difficult to read.

Would you please consider open sourcing the code? I want to mess with it :D

I'll consider it, I don't see why I shouldn't other than the fact it's super messy and is probably difficult to read.

I'll open source it, I'll also make sure to clean it up a bit. I don't think you'll be able to see it until Wednesday or Thursday because I am not home at the moment (I don't have the code uploaded anywhere, only at an external drive).

Thank you ❀ I'll be patient, take your time

Thank you ❀ I'll be patient, take your time

Yeah of course! I'll reply to this letting you know that I've uploaded it here on this repository, and I'll also create a github repo.

Thank you ❀ I'll be patient, take your time

It's your lucky day, turns out I actually had the training script lying in the depths of github.
https://github.com/Locutusque/TinyMistral-train-eval
You can find the notebooks on this github. Please create an issue if you find problems with it.

Hello, I am very curious about the results your experiment. Could you share more details about the efficiency and performance of this approach?
I have also looked through your code on GitHub (https://github.com/Locutusque/TinyMistral-train-eval/blob/main/locutusque-s-train-eval.ipynb), which, as I understand, is designed to create a base model for subsequent fine-tuning. I noticed that the Mistral-7B model uses attention mechanisms such as grouped-query attention (GQA) and sliding window attention (SWA). As I am just beginning my journey with language models and primarily have theoretical knowledge (there is not much of it so far), I find these techniques very interesting. Unfortunately, I was unable to locate in your code where these aforementioned mechanisms (GQA. SWA) used in Mistral-7B are defined. Could you indicate where they can be found? I would greatly appreciate your help.

Hello, I am very curious about the results your experiment. Could you share more details about the efficiency and performance of this approach?
I have also looked through your code on GitHub (https://github.com/Locutusque/TinyMistral-train-eval/blob/main/locutusque-s-train-eval.ipynb), which, as I understand, is designed to create a base model for subsequent fine-tuning. I noticed that the Mistral-7B model uses attention mechanisms such as grouped-query attention (GQA) and sliding window attention (SWA). As I am just beginning my journey with language models and primarily have theoretical knowledge (there is not much of it so far), I find these techniques very interesting. Unfortunately, I was unable to locate in your code where these aforementioned mechanisms (GQA. SWA) used in Mistral-7B are defined. Could you indicate where they can be found? I would greatly appreciate your help.

Hello,

I am not the individual who originally coded the attention mechanisms. Instead, I am developing a new base model that mirrors the architecture of Mistral-7B. The corresponding code is accessible on the Hugging Face Transformers GitHub repository at https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py. Within this code, you should find two distinct attention mechanisms; however, I may have the names slightly incorrect. I believe the names are something like"MistralAttention" and "MistralFlashAttention."Regarding the efficiency and performance, I am not yet able to come up with a valid conclusion; the evaluation that you see on the model card is outdated. That evaluation was based on the 2-million-example model. The current one has not been evaluated yet, and I am hopeful that the performance has improved. I do invite you to do some human-based evaluations to help make a conclusion though! (You can make a pull request if you decide to do this)
I hope this helps!

Thank you for link. I didn't know where to find the code for the Mistral model's mechanisms; it clarified a lot for me. I have another question about the code (https://github.com/Locutusque/TinyMistral-train-eval/blob/main/locutusque-s-train-eval.ipynb). In the main function, the model is loaded using AutoModelForCausalLM.from_pretrained(args.model), where args.model is set to "Locutusque/TinyMistral-248M".
Does this from_pretrained method retrieve a pre-trained model from the Hugging Face repository? After the model is loaded, the code adapts it to a specific dataset and task. I'm not sure if I understand correctly (If it's otherwise, please correct me), but it seems that the code does not create a new model architecture from scratch (a randomly initiated set of weights), but rather relies on a pre-trained model provided by Hugging Face from your repository, which I believe includes a file named model.safetensors. This file is loaded before the training process, and the file in your repository is not empty and has a size of about 900 MB. Is there a way to initialize this model with random weights so that it starts 'clean'?

Thank you for link. I didn't know where to find the code for the Mistral model's mechanisms; it clarified a lot for me. I have another question about the code (https://github.com/Locutusque/TinyMistral-train-eval/blob/main/locutusque-s-train-eval.ipynb). In the main function, the model is loaded using AutoModelForCausalLM.from_pretrained(args.model), where args.model is set to "Locutusque/TinyMistral-248M".
Does this from_pretrained method retrieve a pre-trained model from the Hugging Face repository? After the model is loaded, the code adapts it to a specific dataset and task. I'm not sure if I understand correctly (If it's otherwise, please correct me), but it seems that the code does not create a new model architecture from scratch (a randomly initiated set of weights), but rather relies on a pre-trained model provided by Hugging Face from your repository, which I believe includes a file named model.safetensors. This file is loaded before the training process, and the file in your repository is not empty and has a size of about 900 MB. Is there a way to initialize this model with random weights so that it starts 'clean'?

Yes and no to your question. The model is loaded from scratch, but that file you mentioned is a state dictionary that contains all of the weights and biases of the pretrained model, and then the model.safetensors file is applied to the weights and biases of the randomly initialized model.

To load a model with random weights, you can choose a model class, such as MistralForCausalLM, and the corresponding config class (in this case it would be MistralConfig). Here's an example:

from transformers import MistralConfig, MistralForCausalLM

config = MistralConfig(...) # Define model parameters

model = MistralForCausalLM(config) # Load the model with the config

How long did the model training process take, could you share that information?

How long did the model training process take, could you share that information?

The training process took about 5-6 days of effective training time.

what are your thoughts on finetuning this model on a RTX3060 12GB VRAM. I don't plan it to run all day. Can I run it in batches? I mean like how much time does it take for a single epoch? Maybe I can do 1 epoch a day?

what are your thoughts on finetuning this model on a RTX3060 12GB VRAM. I don't plan it to run all day. Can I run it in batches? I mean like how much time does it take for a single epoch? Maybe I can do 1 epoch a day?

You can certainly do that, although the RTX 3060 has less compute power than the Titan V, you can indeed train in batches that you described.

How many epochs was Titan V able to achieve in a day? I mean it surely depends the data your training on, but just want to know the figure. And how do you rate the model performance. Being 248M parameter model, like how's it performance?

How many epochs was Titan V able to achieve in a day? I mean it surely depends the data your training on, but just want to know the figure. And how do you rate the model performance. Being 248M parameter model, like how's it performance?

It depends on the amount of examples I was doing. On average, it would do around 200k steps per day, which equates to around 1 million examples.

How many epochs was Titan V able to achieve in a day? I mean it surely depends the data your training on, but just want to know the figure. And how do you rate the model performance. Being 248M parameter model, like how's it performance?

For the amount of parameters it has, I consider it to have good performance, especially on the amount of training data it was trained on.

Cool, thats really good to know. Will try it out. What would be a good starting point to get started with Finetuning in general? Any suggestions?

Cool, thats really good to know. Will try it out. What would be a good starting point to get started with Finetuning in general? Any suggestions?

You could consider fine-tuning it on a dataset like Alpaca or OpenPlatypus with LoRA. If you want to do more intense fine-tuning, you could fine-tune all of the model parameters on a more difficult dataset like Orca, or maybe even my InstructMix datasets (7 - 14 million examples depending on which you use).

Sure sure, will try it out with LoRA first, thanks for the suggestions

Sign up or log in to comment