wwwaj commited on
Commit
08a05a1
1 Parent(s): f054138

Add Hardware section

Browse files
Files changed (1) hide show
  1. README.md +11 -5
README.md CHANGED
@@ -124,11 +124,6 @@ output = pipe(messages, **generation_args)
124
  print(output[0]['generated_text'])
125
  ```
126
 
127
- Note that by default the model use flash attention which requires certain types of GPU to run. If you want to run the model on:
128
-
129
- + V100 or earlier generation GPUs: call `AutoModelForCausalLM.from_pretrained()` with `attn_implementation="eager"`
130
- + Optimized inference: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
131
-
132
  ## Responsible AI Considerations
133
 
134
  Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
@@ -213,6 +208,17 @@ The number of k–shot examples is listed per-benchmark.
213
  * [Transformers](https://github.com/huggingface/transformers)
214
  * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
215
 
 
 
 
 
 
 
 
 
 
 
 
216
  ## Cross Platform Support
217
 
218
  ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).
 
124
  print(output[0]['generated_text'])
125
  ```
126
 
 
 
 
 
 
127
  ## Responsible AI Considerations
128
 
129
  Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
 
208
  * [Transformers](https://github.com/huggingface/transformers)
209
  * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
210
 
211
+ ## Hardware
212
+ Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
213
+ * NVIDIA A100
214
+ * NVIDIA A6000
215
+ * NVIDIA H100
216
+
217
+ If you want to run the model on:
218
+ * NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
219
+ * CPU: use the GGUF quantized models 4K
220
+ * Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
221
+
222
  ## Cross Platform Support
223
 
224
  ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).