Can you give a short explanation about the benefits and the architecture?

#7
by SicariusSicariiStuff - opened

I've read your blogpost, can you elaborate on the advantages of this vs flash attention?
It's true that inherently the attention in transformers is quadratic, but since we have flash attention this issue is solved, while we can still have an arbitrarily large model, right?

Another question that comes in mind, is it fair to assume the hybrid factor is what allowed to push model size and coherency "little bit more" by sort of diluting the mamba elements, but there is still a size limit to the model for various reasons, one of them is coherency?

Also, I failed to find any mentions on what type of data (code? language? which language? how many tokens?) the model was trained?
For the community to make better use of the model we need to know what we are working with :)

+1
The biggest problem is the transformer layer in the architecture!
Even the mamba layer is linear to seqence length, the transformer layer with quadratic attention still requres quadratic memory in both training and inference.
Is mamba only not effective enough to capture long context retrieval?

AI21 org

Hey @SicariusSicariiStuff ,

  1. We believe that just like flash attention improved performance for attention, same will happen with mamba computations.
  2. There are more details about the architecture in the paper we released, see if that sheds more light? https://arxiv.org/abs/2403.19887
  3. We trained on a combination of proprietary and public data, including code. Officially the model supports English, Spanish, Portuguese & French, but we saw it learned pretty well a few other languages as well and can be easily extended to support more languages with little training.

Sign up or log in to comment