LMoE version with 3 LoRAs on base Mistral model
Created three separate LoRAs for each of the Actor, Critic and Regenerator models in HelixNet. Then combined into a modified script that dynamically enables them onto the Mistral base according to the actor / critic / regen mode. Memory requirement goes down from 3 x 14GB for the full models to 1 x 14GB + 3 x 320MB for the base + LoRAs.
LoRAs and modified code example here: https://huggingface.co/rhysjones/HelixNet-LMoE-Actor
Really awesome work! What’s the added delay for loading LoRAs?
Loading the LoRAs is very quick (ms). The tradeoff is in the inference performance, since the inference now goes through both the model weights and LoRA deltas - adding an extra step each time.
Initial testing on a 4090 using the demo script gives:
HelixNet Actor model: 44 tokens / second
Mistral + Actor LoRA : 27 tokens / second
That’s still very usable! Nice.
I’ve been running GPTQ 6-bit quantized versions with exllamav2 — getting like 120 tok/second on my 4090. I think the performance is about the same, maybe a slight degradation.
Yes, ExLlamaV2 is excellent!
Turns out exllamav2 also has support for loading multiple LoRAs. Adapting the LMoE to use the 6-bit exl2 quantized version of Mistral and loading in the LoRAs within exllamav2 gives much better results on the 4090:
3 separate models: 120 tokens / second, using 20GB GPU
LMoE combined model: 91 tokens / second, using 8GB GPU
Update at: https://huggingface.co/rhysjones/HelixNet-LMoE-6.0bpw-h6-exl2
Wow, 8GB is within reach of most people. Nice work!
@rhysjones This implementation works fantastically!
The network is not yet perfect, I wanted to get it out to you guys first and then iterate. For example, the regenerator says stuff that are not ideal right now. I’ve started dataset creation for the v2. Will perfect this over time, I think the approach is sound. And much less compute is needed compared to a MoE.
Thanks for your contributions guys!