microsoft
/

Phi-3-mini-128k-instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

wwwaj commited on Apr 22, 2024

Commit

08a05a1

•

1 Parent(s): f054138

Add Hardware section

Files changed (1) hide show

README.md +11 -5

README.md CHANGED Viewed

@@ -124,11 +124,6 @@ output = pipe(messages, **generation_args)
 print(output[0]['generated_text'])
 ```
-Note that by default the model use flash attention which requires certain types of GPU to run. If you want to run the model on:
-+ V100 or earlier generation GPUs: call `AutoModelForCausalLM.from_pretrained()`  with `attn_implementation="eager"`
-+ Optimized inference: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
 ## Responsible AI Considerations
 Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
@@ -213,6 +208,17 @@ The number of k–shot examples is listed per-benchmark.
 * [Transformers](https://github.com/huggingface/transformers)
 * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
 ## Cross Platform Support
 ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).

 print(output[0]['generated_text'])
 ```
 ## Responsible AI Considerations
 Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
 * [Transformers](https://github.com/huggingface/transformers)
 * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
+## Hardware
+Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
+* NVIDIA A100
+* NVIDIA A6000
+* NVIDIA H100
+If you want to run the model on:
+* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
+* CPU: use the GGUF quantized models 4K
+* Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [128K](https://aka.ms/phi3-mini-128k-instruct-onnx)
 ## Cross Platform Support
 ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).