File size: 1,738 Bytes
c2ba4d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#### \[EN\] Upload guide (`jsonl`)
**Basic Requirements**
  * Upload one `jsonl` file per model (e.g., five files to compare five LLMs)
  * ⚠️ Important: All `jsonl` files must have the same number of rows
  * ⚠️ Important: The `model_id` field must be unique within and across all files

**Required Fields**
* Per Model Fields
  * `model_id`: Unique identifier for the model (recommendation: keep it short)
  * `generated`: The LLM's response to the test instruction

* Required only for Translation (`translation_pair` prompt need those. See `streamlit_app_local/user_submit/mt/llama5.jsonl`)
  * `source_lang`: input language (e.g. Korean, KR, kor, ...)
  * `target_lang`: output language (e.g. English, EN, ...)

* Common Fields (Must be identical across all files)
  * `instruction`: The input prompt or test instruction given to the model
  * `task`: Category label used to group results (useful when using different evaluation prompts per task)

**Example Format**
```python
# model1.jsonl
{"model_id": "model1", "task": "directions", "instruction": "Where should I go?", "generated": "Over there"}
{"model_id": "model1", "task": "arithmetic", "instruction": "1+1", "generated": "2"}

# model2.jsonl
{"model_id": "model2", "task": "directions", "instruction": "Where should I go?", "generated": "Head north"}
{"model_id": "model2", "task": "arithmetic", "instruction": "1+1", "generated": "3"}
...
..
.
```
**Use Case Example**
If you want to compare different prompting strategies for the same model:
* Use the same `instruction` across files (using unified test scenarios).
* `generated` responses of each prompting strategy will vary across the files.
* Use descriptive `model_id` values like "prompt1", "prompt2", etc.