What is the max. content length of Mistral-7B-Instruct-v0.2?

#43

by hanshupe - opened Jan 29, 2024

Jan 29, 2024

I couldn't find a clear answer to that: what is the max. input / content length in tokens Mistral-7B-Instruct-v0.2 can process?

mmarianne

Jan 29, 2024

same question here.

guidevops

Jan 30, 2024

this version of mistral-7b is 32k tokens of context.

cognitivetech

Feb 4, 2024

in fact, according to the paper it is 8192

Parameter	Value
dim	4096
n_layers	32
head_dim	128
hidden_dim	14336
n_heads	32
n_kv_heads	8
window_size	4096
context_len	8192
vocab_size	32000

hanshupe

Feb 4, 2024

Thats what I found confusing as well, thought its a very basic question, but didnt find a clear statement.

cognitivetech

Feb 4, 2024

•

edited Feb 4, 2024

among the earliest conversations in this repo, someone read the config file and saw "max_position_embeddings": 32768 misinterpreting that as meaning context length.

However, I have seen updated to 8k on some model leaderboards, and really this is important for self-extend, which is why I wanted to know.

Mistral 0.1 also was trained with the same 8192 context

hanshupe

Feb 4, 2024

okay, so 8k is the max. content length then? what is the meaning of the config file in this case?

cognitivetech

Feb 5, 2024

The terms "training context" and "max_position_embeddings" are both related to the handling of input sequences in transformer-based neural network models, particularly in the context of natural language processing (NLP). However, they refer to different aspects of sequence processing:

Training Context:
- The "training context" typically refers to the length of input sequences used during the training phase of the model. It represents the number of tokens or words considered as context by the model during the training process. The training context determines the size of the input sequences that the model is exposed to during training.
- For example, if a model has a training context of 512 tokens, it means that during training, input sequences of up to 512 tokens in length are used to train the model's parameters (weights and biases).
max_position_embeddings:
- The "max_position_embeddings" parameter specifies the maximum length of input sequences that the model can process during both training and inference. It represents the maximum number of tokens or words that the model's architecture and implementation can handle.
- This parameter determines the size of the positional embedding matrix, which encodes positional information for tokens in a sequence, and limits the length of sequences that the model can effectively process.
- For example, if a model has a max_position_embeddings value of 1024, it means that it can handle input sequences of up to 1024 tokens in length during both training and inference.

In summary, while both the training context and max_position_embeddings parameters are related to handling input sequences in transformer-based models, the training context specifically refers to the length of input sequences used during training, while max_position_embeddings defines the maximum length of input sequences that the model can handle during both training and inference.

hanshupe

Feb 5, 2024

Thanks for the clarification. Then from the user perspective, the relevant one are the 32k, meaning that i can input 32k tokens into my model and can expect that those tokens are considered for generating the output.

cognitivetech

Feb 5, 2024

•

edited Feb 5, 2024

if that's your interpretation, you are welcome to go ahead with that. my experience is that performance is stretched at the 8k limit.

The way I read above (GPT3.5 output) is that its architecture supports fine-tuning up to 32k, but that it will work best out-of-box using 8k max

hanshupe

Feb 6, 2024

Ok got it!

fahadh4ilyas

Apr 6, 2024

if that's your interpretation, you are welcome to go ahead with that. my experience is that performance is stretched at the 8k limit.

The way I read above (GPT3.5 output) is that its architecture supports fine-tuning up to 32k, but that it will work best out-of-box using 8k max

Does this means I do not need to do interpolation to position embedding if I want to fine tune mistral with a longer context?

pandora-s

Mistral AI_ org Apr 7, 2024

Okay so there is a LOT of confusion here.

You are essentially confused between v1 and v2, the one with 8k context length with sliding window and a max of 32k is v1, v2 has a raw context size of 32k without sliding window.

The paper @cognitivetech talked about is the OG paper from the base model v1.

The v2 is based on a different base model with a raw 32k context size.

SO, to sum up, Instruct v1 has 32k with 8k sliding window, and v2 has a pure raw 32k context size.

The answer is 32k.

I hope this answered everyones concerns :> love u

Andriy

Apr 11, 2024

•

edited Apr 11, 2024

@pandora-s How does v2 manage to have a raw 32k context size without sliding window? Full attention has quadratic space complexity. 32k x 32k would require too much space no GPU would support. Even with FlashAttention 2, 8k seems to be the limit. How does it work?

pandora-s

Mistral AI_ org Apr 12, 2024

Pretty sure it uses GQA (Grouped Query Attention), but for you to ask this, I guess you missed a lot of things, I mean, mixtral has been using 32k for a while, command r doesnt even use GQA (tho command r plus does).
Sliding Windows has been slowly been forgotten and avoided to be fair. 🤔

Andriy

Apr 13, 2024

@pandora-s Thanks! How does 32k context size (or 128k context size like in command r) work without GQA? Full attention for such long contexts would require too much memory.

pandora-s

Mistral AI_ org Apr 13, 2024

@pandora-s Thanks! How does 32k context size (or 128k context size like in command r) work without GQA? Full attention for such long contexts would require too much memory.

It does ! It takes a lot of memory, thats most likely the reason they used GQA on the PLUS version, I'm sadly not aware if they did use something else, but I do know that command r consumes a lot of memory in regards of context size.

anshumankmr

Jun 12, 2024

Does anyone know the maximum text output length for this model?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment