MarsupialAI
/

Yeet_51b_200k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

MarsupialAI commited on Jan 21

Commit

b36360a

•

1 Parent(s): 5cd0961

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ license_name: yi-other
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65a531bc7ec6af0f95c707b1/MlNwNVEmqbApTv8gRT8-g.jpeg)
-This model is a rotating-stack merge of three Yi 34b 200k models in a 51b (90 layer) configuration.  My reasoning behind this merge was twofold:  I'd never seen a stacked merge made from 34b models, and I thought that maybe this could give near-70b performance but with a much larger context window, but still fitting within 48GB of VRAM.  I think the results are quite good.  The model performs on par with many 70b models at RP, chat, and storywriting.  At Q4_K_S it will fit into a pair of 24GB GPUs with 32k context.  Coherency at 32k is excellent, and will probably remain very good well beyond that thanks to the 200k base training.
 The gotcha here is speed.  While it inferences as you'd expect for the model size, it's much slower than a similarly-sized 8x7b MoE.  And while I personally find the output of this model to outperform any mixtral finetune I've seen so far, those finetunes are getting better all the time, and this really is achingly slow with a lot of context.  I'm getting less than half a token per second on a pair of P40s with a full 32k prompt.

 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65a531bc7ec6af0f95c707b1/MlNwNVEmqbApTv8gRT8-g.jpeg)
+This model is a rotating-stack merge of three Yi 34b 200k models in a 51b (90 layer) configuration.  My reasoning behind this merge was twofold:  I'd never seen a stacked merge made from 34b models, and I thought that maybe this could give near-70b performance, but with a much larger context window while still fitting within 48GB of VRAM.  I think the results are quite good.  The model performs on par with many 70b models at RP, chat, and storywriting.  At Q4_K_S it will fit into a pair of 24GB GPUs with 32k context.  Coherency at 32k is excellent, and will probably remain very good well beyond that thanks to the 200k base training.
 The gotcha here is speed.  While it inferences as you'd expect for the model size, it's much slower than a similarly-sized 8x7b MoE.  And while I personally find the output of this model to outperform any mixtral finetune I've seen so far, those finetunes are getting better all the time, and this really is achingly slow with a lot of context.  I'm getting less than half a token per second on a pair of P40s with a full 32k prompt.