Wait? 4x13b model?
#1
by
mirek190
- opened
WTF ;D
Yeah, transformer suppots defining your own MoE fine-tuning with MoE.
A MoE is implemented by nn.linear softmax then choose the top # models for sampling. But I am still curious how it is done
Edit:
Here is the disscussion : https://huggingface.co/Undi95/Llamix2-MLewd-4x13B/discussions/1?not-for-all-audiences=true
This is probably a dumb question but i won't know the answer till i ask and research hasn't quite made it clear to me. How do i determine the max context size i can use with this? I see it limited to 2048 in the default sillytavern setup, but i've seen it mentioned that you can turn it up higher in some cases. If I'm asking in the wrong place i apologize and then ask, where's the right place?