size of hidden layers and sliding window attention - dimension is the same, 4096. Is that for a reason?

#153
by keval-sha - opened

Looking at the configuration of Mistral-7B-v0.1:

Model configuration: MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  > "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  > "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.1",
  "use_cache": true,
  "vocab_size": 32000
}

The hidden_size attribute of the hidden layers and sliding_window token length for local attention is exactly the same. Curious, why is it so case?

Those two values aren't related, the sliding window refers to attention context and is related to the max_position_embeddings

Yeah, that makes sense. I am wondering why are they both exactly 4096? Interesting architecture choices.

Sign up or log in to comment