What is the max. content length of Mistral-7B-Instruct-v0.2?

#43
by hanshupe - opened

I couldn't find a clear answer to that: what is the max. input / content length in tokens Mistral-7B-Instruct-v0.2 can process?

same question here.

this version of mistral-7b is 32k tokens of context.

in fact, according to the paper it is 8192

Parameter Value
dim 4096
n_layers 32
head_dim 128
hidden_dim 14336
n_heads 32
n_kv_heads 8
window_size 4096
context_len 8192
vocab_size 32000

Thats what I found confusing as well, thought its a very basic question, but didnt find a clear statement.

among the earliest conversations in this repo, someone read the config file and saw "max_position_embeddings": 32768 misinterpreting that as meaning context length.

However, I have seen updated to 8k on some model leaderboards, and really this is important for self-extend, which is why I wanted to know.

Mistral 0.1 also was trained with the same 8192 context

okay, so 8k is the max. content length then? what is the meaning of the config file in this case?

The terms "training context" and "max_position_embeddings" are both related to the handling of input sequences in transformer-based neural network models, particularly in the context of natural language processing (NLP). However, they refer to different aspects of sequence processing:

  1. Training Context:

    • The "training context" typically refers to the length of input sequences used during the training phase of the model. It represents the number of tokens or words considered as context by the model during the training process. The training context determines the size of the input sequences that the model is exposed to during training.
    • For example, if a model has a training context of 512 tokens, it means that during training, input sequences of up to 512 tokens in length are used to train the model's parameters (weights and biases).
  2. max_position_embeddings:

    • The "max_position_embeddings" parameter specifies the maximum length of input sequences that the model can process during both training and inference. It represents the maximum number of tokens or words that the model's architecture and implementation can handle.
    • This parameter determines the size of the positional embedding matrix, which encodes positional information for tokens in a sequence, and limits the length of sequences that the model can effectively process.
    • For example, if a model has a max_position_embeddings value of 1024, it means that it can handle input sequences of up to 1024 tokens in length during both training and inference.

In summary, while both the training context and max_position_embeddings parameters are related to handling input sequences in transformer-based models, the training context specifically refers to the length of input sequences used during training, while max_position_embeddings defines the maximum length of input sequences that the model can handle during both training and inference.

Thanks for the clarification. Then from the user perspective, the relevant one are the 32k, meaning that i can input 32k tokens into my model and can expect that those tokens are considered for generating the output.

if that's your interpretation, you are welcome to go ahead with that. my experience is that performance is stretched at the 8k limit.

The way I read above (GPT3.5 output) is that its architecture supports fine-tuning up to 32k, but that it will work best out-of-box using 8k max

Ok got it!

if that's your interpretation, you are welcome to go ahead with that. my experience is that performance is stretched at the 8k limit.

The way I read above (GPT3.5 output) is that its architecture supports fine-tuning up to 32k, but that it will work best out-of-box using 8k max

Does this means I do not need to do interpolation to position embedding if I want to fine tune mistral with a longer context?

Okay so there is a LOT of confusion here.

You are essentially confused between v1 and v2, the one with 8k context length with sliding window and a max of 32k is v1, v2 has a raw context size of 32k without sliding window.

The paper @cognitivetech talked about is the OG paper from the base model v1.

The v2 is based on a different base model with a raw 32k context size.

SO, to sum up, Instruct v1 has 32k with 8k sliding window, and v2 has a pure raw 32k context size.

The answer is 32k.

I hope this answered everyones concerns :> love u

@pandora-s How does v2 manage to have a raw 32k context size without sliding window? Full attention has quadratic space complexity. 32k x 32k would require too much space no GPU would support. Even with FlashAttention 2, 8k seems to be the limit. How does it work?

Pretty sure it uses GQA (Grouped Query Attention), but for you to ask this, I guess you missed a lot of things, I mean, mixtral has been using 32k for a while, command r doesnt even use GQA (tho command r plus does).
Sliding Windows has been slowly been forgotten and avoided to be fair. 🤔

@pandora-s Thanks! How does 32k context size (or 128k context size like in command r) work without GQA? Full attention for such long contexts would require too much memory.

@pandora-s Thanks! How does 32k context size (or 128k context size like in command r) work without GQA? Full attention for such long contexts would require too much memory.

It does ! It takes a lot of memory, thats most likely the reason they used GQA on the PLUS version, I'm sadly not aware if they did use something else, but I do know that command r consumes a lot of memory in regards of context size.

Does anyone know the maximum text output length for this model?

Sign up or log in to comment