Zack Zhiyuan Li commited on
Commit
10bd713
1 Parent(s): 3e77051

add benchmark with openELM and Phi-3

Browse files
Files changed (2) hide show
  1. OpenELM-benchmark.jpeg +0 -0
  2. README.md +6 -3
OpenELM-benchmark.jpeg ADDED
README.md CHANGED
@@ -32,8 +32,11 @@ Check the Octopus V3 demo video for [Android and iOS](https://octopus3.nexa4ai.c
32
  </p>
33
 
34
 
35
- ## Octopus V2
36
- We are a very small team with many work. Please give us more time to prepare the code, and we will **open source** it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to **any function**, and you can start to think about it now.
 
 
 
37
  <p align="center">
38
  - <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
39
  - <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
@@ -97,7 +100,7 @@ print("latency:", time.time() - start_time," s")
97
 
98
  ## Evaluation
99
 
100
- The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which is manually verified. All the queries in the benchmark test are sampled by Gemini.
101
 
102
  <p align="center" width="100%">
103
  <a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>
 
32
  </p>
33
 
34
 
35
+ ## Octopus V2 Release
36
+ After open-sourcing our model, we got many requests to compare our model with [Apple's OpenELM](https://huggingface.co/apple/OpenELM-3B-Instruct) and [Microsoft's Phi-3](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). Please see [Evaluation section](#evaluation). From our benchmark dataset, Microsoft's Phi-3 achieves accuracy of 45.7% and the average inference latency is 10.2s. While Apple's OpenELM fails to generate function call, please see [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Our model, Octopus V2, achieves 99.5% accuracy and the average inference latency is 0.38s.
37
+
38
+ We are a very small team with many work. Please give us more time to prepare the code, and we will **open source** it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to **any function**, and you can start to think about it now.
39
+
40
  <p align="center">
41
  - <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
42
  - <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
 
100
 
101
  ## Evaluation
102
 
103
+ The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which has been manually verified. Microsoft's Phi-3 model achieved an accuracy of 45.7%, with an average inference latency of 10.2 seconds. Meanwhile, Apple's OpenELM was unable to generate a function call, as shown in [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Additionally, OpenELM's score on the MMLU benchmark is quite low at 26.7, compared to Google's Gemma 2B, which scored 42.3.
104
 
105
  <p align="center" width="100%">
106
  <a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>