Xenova HF staff commited on
Commit
c155a0e
1 Parent(s): d50c5c3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -62
README.md CHANGED
@@ -1,62 +1,62 @@
1
- ---
2
- license: mit
3
- pipeline_tag: text-generation
4
- tags:
5
- - ONNX
6
- - ONNXRuntime
7
- - ONNXRuntimeWeb
8
- - phi3
9
- - transformers.js
10
- - transformers
11
- - nlp
12
- - conversational
13
- - custom_code
14
- inference: false
15
- ---
16
-
17
- # Phi-3 Mini-4K-Instruct ONNX model for in-browser inference
18
-
19
- <!-- Provide a quick summary of what the model is/does. -->
20
- Running Phi3-mini-4K entirely in the browser! Check out this [demo](https://guschmue.github.io/ort-webgpu/chat/index.html).
21
-
22
- This repository hosts the optimized Web version of ONNX Phi-3-mini-4k-instruct model to accelerate inference in the browser with ONNX Runtime Web.
23
-
24
- [The Phi-3-Mini-4K-Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
25
-
26
- ## How to run
27
-
28
- [ONNX Runtime Web](https://onnxruntime.ai/docs/tutorials/web/build-web-app.html) is a JavaScript library to enable web developers to deploy machine learning models directly in web browsers, offering multiple backends leveraging hardware acceleration. WebGPU backend is recommended to run Phi-3-mini efficiently.
29
-
30
-
31
- Here is an [E2E example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/chat) for running this optimized Phi3-mini-4K for the web, with ONNX Runtime harnessing WebGPU.
32
-
33
-
34
- **Supported devices and browser with WebGPU**: Chrome 113+ and Edge 113+ for Mac, Windows, ChromeOS, and Chrome 121+ for Android. Pls visit [here](https://github.com/gpuweb/gpuweb/wiki/Implementation-Status#safari-in-progress) for tracking WebGPU support in browsers
35
-
36
- ## Performance Metrics
37
- Performance vary between GPUs. The more powerful the GPU, the faster the speed. On a NVIDIA GeForce RTX 4090: ~42 tokens/second
38
-
39
-
40
- ## Additional Details
41
-
42
- To obtain other optimized Phi3-mini-4k ONNX models for server platforms, Windows, Linux, Mac desktops, and mobile, please visit [Phi-3-mini-4k-instruct onnx model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx). The model differences in the web version compared to other versions:
43
-
44
- 1. the model is fp16 with int4 block quantization for weights
45
- 2. the 'logits' output is fp32
46
- 3. the model uses MHA instead of GQA
47
- 4. onnx and external data file need to stay below 2GB to be cacheable in chromium
48
-
49
- To optimize a fine-tuned Phi3-mini-4k model to run with ONNX Runtime Web, please follow [this Olive example](https://github.com/microsoft/Olive/tree/main/examples/phi3). [Olive](https://github.com/microsoft/OLive) is an easy-to-use model optimization tool for generating an optimized ONNX model to efficiently run with ONNX Runtime across platforms.
50
-
51
-
52
- ## Model Description
53
-
54
- - **Developed by:** Microsoft
55
- - **Model type:** ONNX
56
- - **Inference Language(s) (NLP):** JavaScript
57
- - **License:** MIT
58
- - **Model Description:** This is the web version of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
59
-
60
-
61
- ## Model Card Contact
62
- guschmue, qining
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - ONNX
6
+ - ONNXRuntime
7
+ - ONNXRuntimeWeb
8
+ - phi3
9
+ - transformers.js
10
+ - transformers
11
+ - nlp
12
+ - conversational
13
+ - custom_code
14
+ inference: false
15
+ ---
16
+
17
+ # Phi-3 Mini-4K-Instruct ONNX model for in-browser inference
18
+
19
+ <!-- Provide a quick summary of what the model is/does. -->
20
+ Running Phi3-mini-4K entirely in the browser! Check out this [demo](https://guschmue.github.io/ort-webgpu/chat/index.html).
21
+
22
+ This repository hosts the optimized Web version of ONNX Phi-3-mini-4k-instruct model to accelerate inference in the browser with ONNX Runtime Web.
23
+
24
+ [The Phi-3-Mini-4K-Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
25
+
26
+ ## How to run
27
+
28
+ [ONNX Runtime Web](https://onnxruntime.ai/docs/tutorials/web/build-web-app.html) is a JavaScript library to enable web developers to deploy machine learning models directly in web browsers, offering multiple backends leveraging hardware acceleration. WebGPU backend is recommended to run Phi-3-mini efficiently.
29
+
30
+
31
+ Here is an [E2E example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/chat) for running this optimized Phi3-mini-4K for the web, with ONNX Runtime harnessing WebGPU.
32
+
33
+
34
+ **Supported devices and browser with WebGPU**: Chrome 113+ and Edge 113+ for Mac, Windows, ChromeOS, and Chrome 121+ for Android. Pls visit [here](https://github.com/gpuweb/gpuweb/wiki/Implementation-Status#safari-in-progress) for tracking WebGPU support in browsers
35
+
36
+ ## Performance Metrics
37
+ Performance vary between GPUs. The more powerful the GPU, the faster the speed. On a NVIDIA GeForce RTX 4090: ~42 tokens/second
38
+
39
+
40
+ ## Additional Details
41
+
42
+ To obtain other optimized Phi3-mini-4k ONNX models for server platforms, Windows, Linux, Mac desktops, and mobile, please visit [Phi-3-mini-4k-instruct onnx model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx). The model differences in the web version compared to other versions:
43
+
44
+ 1. the model is fp16 with int4 block quantization for weights
45
+ 2. the 'logits' output is fp32
46
+ 3. the model uses MHA instead of GQA
47
+ 4. onnx and external data file need to stay below 2GB to be cacheable in chromium
48
+
49
+ To optimize a fine-tuned Phi3-mini-4k model to run with ONNX Runtime Web, please follow [this Olive example](https://github.com/microsoft/Olive/tree/main/examples/phi3). [Olive](https://github.com/microsoft/OLive) is an easy-to-use model optimization tool for generating an optimized ONNX model to efficiently run with ONNX Runtime across platforms.
50
+
51
+
52
+ ## Model Description
53
+
54
+ - **Developed by:** Microsoft
55
+ - **Model type:** ONNX
56
+ - **Inference Language(s) (NLP):** JavaScript
57
+ - **License:** MIT
58
+ - **Model Description:** This is the web version of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
59
+
60
+
61
+ ## Model Card Contact
62
+ guschmue, qining