Awesome work!
This is really great work, thanks for taking it on! I haven't fully tested it yet but it seems like a super promising avenue for fine-tuning and experimenting.
I was wondering if you'd be willing to share the modeling architecture script you created for this smaller Jamba. I am incredibly fascinated by the new SSM-Transformer but have the deep knowledge yet (still learning) to make one myself. It'd be awesome to see how you figured it out. No worries if you want to keep it private though, figured it was worth an ask : )
Hey, @Severian , thank you I do appreciate it!
I can definitely share the script I used to prune the model! Not a problem at all. Don't have immediate cluster access at the moment, so give me a bit and I'll let you know once it's been uploaded to this repo.
Do be aware that the Jamba-v0.1 model, when loaded in full precision, requires a significant amount of memory to naively load. I personally load the model to CPU on a system with 512GB+ of RAM. If you don't have access to a system with these specifications, you'll likely need to load a quantized version of Jamba to replicate results.
@Severian
I've added the pruning script to this repo, you can view it here: https://huggingface.co/OxxoCodes/jamba-small-v1/blob/main/prune.py
I've also created a v2 using different layer mapping, feel free to check that out here: https://huggingface.co/OxxoCodes/jamba-small-v2
Let me know if you'd like to know anything else about the model, cheers! β
Thanks for sharing. This is really clever and such a great way to get more models from the core architecture. You are a mad genius
I'm going to try and train the V2 you dropped, I'll let you know how it goes!
Thank you! Definitely let me know how the training run goes, currently planning to get one going myself but university work comes first (for now π)