Add Hardware section
Browse files
README.md
CHANGED
@@ -124,11 +124,6 @@ output = pipe(messages, **generation_args)
|
|
124 |
print(output[0]['generated_text'])
|
125 |
```
|
126 |
|
127 |
-
Note that by default the model use flash attention which requires certain types of GPU to run. If you want to run the model on:
|
128 |
-
|
129 |
-
+ V100 or earlier generation GPUs: call `AutoModelForCausalLM.from_pretrained()` with `attn_implementation="eager"`
|
130 |
-
+ Optimized inference: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
|
131 |
-
|
132 |
## Responsible AI Considerations
|
133 |
|
134 |
Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
@@ -213,6 +208,17 @@ The number of k–shot examples is listed per-benchmark.
|
|
213 |
* [Transformers](https://github.com/huggingface/transformers)
|
214 |
* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
|
215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
216 |
## Cross Platform Support
|
217 |
|
218 |
ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).
|
|
|
124 |
print(output[0]['generated_text'])
|
125 |
```
|
126 |
|
|
|
|
|
|
|
|
|
|
|
127 |
## Responsible AI Considerations
|
128 |
|
129 |
Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
|
|
208 |
* [Transformers](https://github.com/huggingface/transformers)
|
209 |
* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
|
210 |
|
211 |
+
## Hardware
|
212 |
+
Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
|
213 |
+
* NVIDIA A100
|
214 |
+
* NVIDIA A6000
|
215 |
+
* NVIDIA H100
|
216 |
+
|
217 |
+
If you want to run the model on:
|
218 |
+
* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
|
219 |
+
* CPU: use the GGUF quantized models 4K
|
220 |
+
* Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
|
221 |
+
|
222 |
## Cross Platform Support
|
223 |
|
224 |
ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).
|