Mozilla
/

Mixtral-8x22B-Instruct-v0.1-llamafile

Model card Files Files and versions Community

jartine commited on Apr 25

Commit

3af6a03

•

1 Parent(s): b4d83da

Update README.md

Files changed (1) hide show

README.md +15 -14

README.md CHANGED Viewed

@@ -114,20 +114,21 @@ speedups for llama.cpp's simplest quants: Q8\_0 and Q4\_0.
 This model is very large. Even at Q2 quantization, it's still well-over
 twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
 splitting models over multiple GPUs (for NVIDIA only currently) if you
-have such a system. The best way to get one, if you don't, is to pay a
-few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
-Mac Studio is a good option for running this model. An M2 Ultra desktop
-from Apple is affordable and has 128GB of unified RAM+VRAM. If you have
-one, then llamafile will use your Metal GPU. Try starting out with the
-`Q4_0` quantization level.
-Another good option for running very large, large language models is to
-just use CPU. We developed new tensor multiplication kernels on the
-llamafile project specifically to speed up "mixture of experts" LLMs
-like Mixtral. On a AMD Threadripper Pro 7995WX with 256GB of 5200 MT/s
-RAM, llamafile v0.8 runs Mixtral 8x22B Q4\_0 at 98 tokens per second for
-evaluation, and it predicts 9.44 tokens per second.
 ---

 This model is very large. Even at Q2 quantization, it's still well-over
 twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
 splitting models over multiple GPUs (for NVIDIA only currently) if you
+have such a system. The easiest way to have one, if you don't, is to pay
+a few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
+Mac Studio is a good option for running this model locally. An M2 Ultra
+desktop from Apple is affordable and has 128GB of unified RAM+VRAM. If
+you have one, then llamafile will use your Metal GPU. Try starting out
+with the `Q4_0` quantization level.
+Another good option for running large, large language models locally and
+fully under your control is to just use CPU inference. We developed new
+tensor multiplication kernels on the llamafile project specifically to
+speed up "mixture of experts" LLMs like Mixtral. On a AMD Threadripper
+Pro 7995WX with 256GB of 5200 MT/s RAM, llamafile v0.8 runs Mixtral
+8x22B Q4\_0 on Linux at 98 tokens per second for evaluation, and it
+predicts 9.44 tokens per second.
 ---