Pretrain?

#125
by limha - opened

Hi, I'm just using Mistral and read the paper for the first time a few days ago, and I'm so sad that i read this great paper too late.

I have a question about Mistral-7b. I knew there is very little difference between mistral structure and llama-2 structure, and i think paper said just three useful tools only for inference and memory optimization (1. SWA, 2. Rolling buffer cache, 3. Pre-fill and chunking). I think there is nothing about pre-training on paper.

Therefore, i can only assume that just changing model architecture for inference makes a big difference.
(Or did mistral ai use llama-2 parameter weights?) (THIS IS ONLY MY THINKING)

I don't want to know about datasets for pre-training if you did (becasue of reading the discussion you cannot tell us about dataset before), only want to know about whether there is pre-train process and there is another skill to pretrain.

I apologize if I've been rude, and I hope anyone let me know if I've gotten anything wrong or misunderstood.
Thank you for reading my question!

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

that's how miqu (close relative to mistral medium) was created apparently

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

that's how miqu (close relative to mistral medium) was created apparently

Why is miqu a proof of that...?

Sign up or log in to comment