NexaAIDev
/

Octopus-v2

Text Generation

function calling

on-device language model

text-generation-inference

Model card Files Files and versions Community

Zack Zhiyuan Li commited on Apr 30, 2024

Commit

10bd713

·

1 Parent(s): 3e77051

add benchmark with openELM and Phi-3

Files changed (2) hide show

OpenELM-benchmark.jpeg +0 -0
README.md +6 -3

OpenELM-benchmark.jpeg ADDED Viewed

README.md CHANGED Viewed

@@ -32,8 +32,11 @@ Check the Octopus V3 demo video for [Android and iOS](https://octopus3.nexa4ai.c
 </p>
-## Octopus V2
-We are a very small team with many work. Please give us more time to prepare the code, and we will **open source** it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to **any function**, and you can start to think about it now.
 <p align="center">
 - <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
 - <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
@@ -97,7 +100,7 @@ print("latency:", time.time() - start_time," s")
 ## Evaluation
-The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which is manually verified. All the queries in the benchmark test are sampled by Gemini.
 <p align="center" width="100%">
 <a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>

 </p>
+## Octopus V2 Release
+After open-sourcing our model, we got many requests to compare our model with [Apple's OpenELM](https://huggingface.co/apple/OpenELM-3B-Instruct) and [Microsoft's Phi-3](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). Please see [Evaluation section](#evaluation). From our benchmark dataset, Microsoft's Phi-3 achieves accuracy of 45.7% and the average inference latency is 10.2s. While Apple's OpenELM fails to generate function call, please see [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Our model, Octopus V2, achieves 99.5% accuracy and the average inference latency is 0.38s.
+We are a very small team with many work. Please give us more time to prepare the code, and we will **open source** it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to **any function**, and you can start to think about it now.
 <p align="center">
 - <a href="https://www.nexa4ai.com/" target="_blank">Nexa AI Product</a>
 - <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
 ## Evaluation
+The benchmark result can be viewed in [this excel](android_benchmark.xlsx), which has been manually verified. Microsoft's Phi-3 model achieved an accuracy of 45.7%, with an average inference latency of 10.2 seconds. Meanwhile, Apple's OpenELM was unable to generate a function call, as shown in [this screenshot](https://huggingface.co/NexaAIDev/Octopus-v2/blob/main/OpenELM-benchmark.jpeg). Additionally, OpenELM's score on the MMLU benchmark is quite low at 26.7, compared to Google's Gemma 2B, which scored 42.3.
 <p align="center" width="100%">
 <a><img src="latency_plot.jpg" alt="ondevice" style="width: 80%; min-width: 300px; display: block; margin: auto; margin-bottom: 20px;"></a>