Smaller version to ease implementation experiments?

#12

by compilade - opened Mar 29

Mar 29

•

Hi. I've worked on implementing Mamba support in llama.cpp before (see https://github.com/ggerganov/llama.cpp/pull/5328), and I'd like to eventually implement support for Jamba too.

However, for my hardware, this model is too big for quick experimentation, so I'd really appreciate it if you'd also release a smaller model with the same architecture. It doesn't need to be good (though some coherency is preferred). Ideally a Jamba model with less than 1B parameters would help a lot with this, if possible.

Pclanglais

Mar 29

I second this. Loading the weights take a really long time. Some light version (with pruning?) even if the end results is not effective at all would be great for quick testing iteration.

ozymandias-kingofkingsesq

Mar 29

I third this

TechxGenus

Mar 30

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba

compilade

Mar 31

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba

Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset and attn_layer_period keys in config.json differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?

TechxGenus

Mar 31

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba

Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset and attn_layer_period keys in config.json differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?

Ah, this is because I set expert_layer_offset and expert_layer_period to be the same as attn_layer_offset and attn_layer_period. I wanted to first test the results of using MoE only in the Attention layer when making this version.

I will make a new version with Mamba+MoE, Mamba+MLP, Attention+MoE, Attention+MLP at the same time later.

TechxGenus

Mar 31

https://huggingface.co/TechxGenus/Mini-Jamba-v2

ordagan

AI21 org Apr 24

Hi, we uploaded this version for debugging and development purposes (random weights, no training whatsoever)
https://huggingface.co/ai21labs/Jamba-tiny-random

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment