Huanzhi Mao commited on
Commit
383da93
1 Parent(s): c49d028

update description

Browse files
Files changed (1) hide show
  1. app.py +17 -19
app.py CHANGED
@@ -1051,8 +1051,6 @@ with gr.Blocks() as demo:
1051
  **FC = native support for function/tool calling.**
1052
 
1053
  **Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
1054
-
1055
- **AST Summary is the unweighted average of the four test categories under AST Evaluation. Exec Summary is the unweighted average of the four test categories under Exec Evaluation.**
1056
 
1057
  **Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
1058
  """
@@ -1062,28 +1060,28 @@ with gr.Blocks() as demo:
1062
  with gr.TabItem("Evaluation Categories"):
1063
  gr.Markdown(
1064
  """
1065
- # Python Evaluation
1066
-
1067
- **Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
1068
 
1069
- **Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context.
1070
 
1071
- **Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence.
 
 
 
 
 
 
1072
 
1073
- **Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
1074
- """
1075
 
1076
- )
1077
- gr.Markdown(
1078
- """
1079
- # non-Python Evaluation
1080
-
1081
  In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
1082
-
1083
- In **REST**, we include real world GET requests to test the model's capabilities to generate executable REST API calls through complex function documentations, using requests.get() along with the API's hardcoded URL and description of the purpose of the function and its parameters. Our evaluation includes two variations. The first type requires passing the parameters inside the URL, called path parameters. The second type requires the model to put parameters as key/value pairs into the params and/or headers of requests.get(.).
1084
-
1085
- In **Java** and **Javascript**, the goal is to understand how well the function calling model can be extended to not just Python type but all the language specific typings such as the HashMap in Java. We included 100 examples for Java AST evaluation and 70 examples for Javascript AST evaluation.
1086
- """)
1087
  with gr.TabItem("Try It Out"):
1088
  with gr.Row():
1089
  with gr.Column(scale=1):
 
1051
  **FC = native support for function/tool calling.**
1052
 
1053
  **Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.**
 
 
1054
 
1055
  **Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via [discord](https://discord.gg/SwTyuTAxX3).**
1056
  """
 
1060
  with gr.TabItem("Evaluation Categories"):
1061
  gr.Markdown(
1062
  """
1063
+ ### What are the different columns representing in the leaderboard?
 
 
1064
 
1065
+ We provide a short summary here. For more details, please refer to our release [blog](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html):
1066
 
1067
+ **AST** means evaluation through Abstract Syntax Tree, and **Exec** means evaluation through execution.
1068
+
1069
+ **Cost** is calculated as an estimate of the cost per 1000 function calls, in USD.
1070
+
1071
+ **Latency** is measured in seconds.
1072
+
1073
+ **Simple Function** evaluation contains the simplest but most commonly seen format, where the user supplies a single JSON function document, with one and only one function call will be invoked.
1074
 
1075
+ **Multiple Function** contains a user question that only invokes one function call out of 2 to 4 JSON function documentations. The model needs to be capable of selecting the best function to invoke according to user provided context. For example, if the prompt is `what is 2 + 3?` and the options are `add()` and `mult()`, the model should select `add()`.
 
1076
 
1077
+ **Parallel Function** is defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence. For example, if the prompt is `What's the weather in San Francisco and New York` and the function provided is `get_weather()`, the model should return both `get_weather('San Francisco')` and `get_weather('New York')`.
1078
+
1079
+ **Parallel Multiple Function** is the combination of parallel function and multiple function. In another word, the model is provided with multiple function documentations, each of the corresponding function calls will be invoked zero or more times.
1080
+
 
1081
  In **relevance detection**, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model will hallucinate on its function and parameter to generate function code despite lacking the function information or instructions from the users to do so.
1082
+ """
1083
+ )
1084
+
 
 
1085
  with gr.TabItem("Try It Out"):
1086
  with gr.Row():
1087
  with gr.Column(scale=1):