Update README.md
Browse files
README.md
CHANGED
@@ -114,20 +114,21 @@ speedups for llama.cpp's simplest quants: Q8\_0 and Q4\_0.
|
|
114 |
This model is very large. Even at Q2 quantization, it's still well-over
|
115 |
twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
|
116 |
splitting models over multiple GPUs (for NVIDIA only currently) if you
|
117 |
-
have such a system. The
|
118 |
-
few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
|
119 |
-
|
120 |
-
Mac Studio is a good option for running this model. An M2 Ultra
|
121 |
-
from Apple is affordable and has 128GB of unified RAM+VRAM. If
|
122 |
-
one, then llamafile will use your Metal GPU. Try starting out
|
123 |
-
`Q4_0` quantization level.
|
124 |
-
|
125 |
-
Another good option for running
|
126 |
-
just use CPU. We developed new
|
127 |
-
|
128 |
-
like Mixtral. On a AMD Threadripper
|
129 |
-
RAM, llamafile v0.8 runs Mixtral
|
130 |
-
|
|
|
131 |
|
132 |
---
|
133 |
|
|
|
114 |
This model is very large. Even at Q2 quantization, it's still well-over
|
115 |
twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
|
116 |
splitting models over multiple GPUs (for NVIDIA only currently) if you
|
117 |
+
have such a system. The easiest way to have one, if you don't, is to pay
|
118 |
+
a few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
|
119 |
+
|
120 |
+
Mac Studio is a good option for running this model locally. An M2 Ultra
|
121 |
+
desktop from Apple is affordable and has 128GB of unified RAM+VRAM. If
|
122 |
+
you have one, then llamafile will use your Metal GPU. Try starting out
|
123 |
+
with the `Q4_0` quantization level.
|
124 |
+
|
125 |
+
Another good option for running large, large language models locally and
|
126 |
+
fully under your control is to just use CPU inference. We developed new
|
127 |
+
tensor multiplication kernels on the llamafile project specifically to
|
128 |
+
speed up "mixture of experts" LLMs like Mixtral. On a AMD Threadripper
|
129 |
+
Pro 7995WX with 256GB of 5200 MT/s RAM, llamafile v0.8 runs Mixtral
|
130 |
+
8x22B Q4\_0 on Linux at 98 tokens per second for evaluation, and it
|
131 |
+
predicts 9.44 tokens per second.
|
132 |
|
133 |
---
|
134 |
|