gregmialz commited on
Commit
ac816a0
1 Parent(s): 2937740

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +4 -0
content.py CHANGED
@@ -10,6 +10,8 @@ GAIA is made of 3 evaluation levels, depending on the added level of tooling and
10
  We expect the level 1 to be breakable by very good LLMs, and the level 3 to indicate a strong jump in model capabilities.
11
  Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.
12
 
 
 
13
  We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
14
  ```
15
  {"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
@@ -19,6 +21,8 @@ We expect submissions to be json-line files with the following format. The first
19
 
20
  Scores are expressed as the percentage of correct answers for a given split.
21
 
 
 
22
  Please do not repost the public dev set, nor use it in training data for your models.
23
  """
24
 
 
10
  We expect the level 1 to be breakable by very good LLMs, and the level 3 to indicate a strong jump in model capabilities.
11
  Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.
12
 
13
+ The data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
14
+
15
  We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
16
  ```
17
  {"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
 
21
 
22
  Scores are expressed as the percentage of correct answers for a given split.
23
 
24
+ Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
25
+
26
  Please do not repost the public dev set, nor use it in training data for your models.
27
  """
28