`max_position_embeddings` in config.json

by naive-puzzle - opened Feb 21

Feb 21

Why has the max_position_embeddings parameter in the configuration file been altered to 4096? Such a modification appears to render the sliding window attention in Mistral-7b ineffective.

Is there a particular rationale behind this adjustment?

stabilityai/japanese-stablelm-base-gamma-7b

  "intermediate_size": 14336,
  "max_position_embeddings": 4096,
  "model_type": "mistral",

mistralai/Mistral-7B-v0.1

  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",

naive-puzzle

Feb 21

btw, if the text you used during training comprised tokens fewer than or equal to 4096, then it seems that the max_position_embeddings would not affect the weights of the model post-training at all.

modeling_mistral.py

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment