Any plans to use MQA (multi-query attention) or GQA (grouped-query attention) in the future?

by NilsGraef - opened May 2

May 2

This model uses MHA (multi-head attention, i.e. num_attention_heads == num_key_value_heads). This is unlike Llama, which uses GQA.

The problem with MHA is that the KV-caches are very big (because KV-cache size is proportional to num_key_value_heads).

Therefore, do you have any plans to use MQA or GQA for future model releases? Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment