Zack Zhiyuan Li
commited on
Commit
•
10bd713
1
Parent(s):
3e77051
add benchmark with openELM and Phi-3
Browse files- OpenELM-benchmark.jpeg +0 -0
- README.md +6 -3
OpenELM-benchmark.jpeg
ADDED
README.md
CHANGED
@@ -32,8 +32,11 @@ Check the Octopus V3 demo video for [Android and iOS](https://octopus3.nexa4ai.c
|
|
32 |
</p>
|
33 |
|
34 |
|
35 |
-
## Octopus V2
|
36 |
-
|
|
|
|
|
|
|
37 |
<p align="center">
|
38 |
- <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
|
39 |
- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
|
@@ -97,7 +100,7 @@ print("latency:", time.time() - start_time," s")
|
|
97 |
|
98 |
## Evaluation
|
99 |
|
100 |
-
The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which
|
101 |
|
102 |
<p align="center" width="100%">
|
103 |
<a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>
|
|
|
32 |
</p>
|
33 |
|
34 |
|
35 |
+
## Octopus V2 Release
|
36 |
+
After open-sourcing our model, we got many requests to compare our model with [Apple's OpenELM](https://huggingface.co/apple/OpenELM-3B-Instruct) and [Microsoft's Phi-3](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). Please see [Evaluation section](#evaluation). From our benchmark dataset, Microsoft's Phi-3 achieves accuracy of 45.7% and the average inference latency is 10.2s. While Apple's OpenELM fails to generate function call, please see [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Our model, Octopus V2, achieves 99.5% accuracy and the average inference latency is 0.38s.
|
37 |
+
|
38 |
+
We are a very small team with many work. Please give us more time to prepare the code, and we will **open source** it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to **any function**, and you can start to think about it now.
|
39 |
+
|
40 |
<p align="center">
|
41 |
- <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
|
42 |
- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
|
|
|
100 |
|
101 |
## Evaluation
|
102 |
|
103 |
+
The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which has been manually verified. Microsoft's Phi-3 model achieved an accuracy of 45.7%, with an average inference latency of 10.2 seconds. Meanwhile, Apple's OpenELM was unable to generate a function call, as shown in [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Additionally, OpenELM's score on the MMLU benchmark is quite low at 26.7, compared to Google's Gemma 2B, which scored 42.3.
|
104 |
|
105 |
<p align="center" width="100%">
|
106 |
<a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>
|