Spaces:
AIR-Bench
/
Running on CPU Upgrade

feat-improve-submission-page-0517

#10
Files changed (1) hide show
  1. src/about.py +11 -123
src/about.py CHANGED
@@ -4,137 +4,25 @@ TITLE = """<h1 align="center" id="space-title">AIR-Bench: Automated Heterogeneou
4
 
5
  # What does your leaderboard evaluate?
6
  INTRODUCTION_TEXT = """
7
- Check more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench)
8
  """
9
 
10
  # Which evaluations are you running? how can people reproduce what you have?
11
  BENCHMARKS_TEXT = f"""
12
- ## How it works
 
13
 
14
- Check more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench)
15
- """
16
-
17
- EVALUATION_QUEUE_TEXT = """
18
- ## Steps for submit to AIR-Bench
19
-
20
- 1. Install AIR-Bench
21
- ```bash
22
- pip install air-benchmark
23
- ```
24
- 2. Run the evaluation script
25
- ```bash
26
- cd AIR-Bench/scripts
27
- # Run all tasks
28
- python run_air_benchmark.py \\
29
- --output_dir ./search_results \\
30
- --encoder BAAI/bge-m3 \\
31
- --reranker BAAI/bge-reranker-v2-m3 \\
32
- --search_top_k 1000 \\
33
- --rerank_top_k 100 \\
34
- --max_query_length 512 \\
35
- --max_passage_length 512 \\
36
- --batch_size 512 \\
37
- --pooling_method cls \\
38
- --normalize_embeddings True \\
39
- --use_fp16 True \\
40
- --add_instruction False \\
41
- --overwrite False
42
-
43
- # Run the tasks in the specified task type
44
- python run_air_benchmark.py \\
45
- --task_types long-doc \\
46
- --output_dir ./search_results \\
47
- --encoder BAAI/bge-m3 \\
48
- --reranker BAAI/bge-reranker-v2-m3 \\
49
- --search_top_k 1000 \\
50
- --rerank_top_k 100 \\
51
- --max_query_length 512 \\
52
- --max_passage_length 512 \\
53
- --batch_size 512 \\
54
- --pooling_method cls \\
55
- --normalize_embeddings True \\
56
- --use_fp16 True \\
57
- --add_instruction False \\
58
- --overwrite False
59
 
60
- # Run the tasks in the specified task type and domains
61
- python run_air_benchmark.py \\
62
- --task_types long-doc \\
63
- --domains arxiv book \\
64
- --output_dir ./search_results \\
65
- --encoder BAAI/bge-m3 \\
66
- --reranker BAAI/bge-reranker-v2-m3 \\
67
- --search_top_k 1000 \\
68
- --rerank_top_k 100 \\
69
- --max_query_length 512 \\
70
- --max_passage_length 512 \\
71
- --batch_size 512 \\
72
- --pooling_method cls \\
73
- --normalize_embeddings True \\
74
- --use_fp16 True \\
75
- --add_instruction False \\
76
- --overwrite False
77
 
78
- # Run the tasks in the specified languages
79
- python run_air_benchmark.py \\
80
- --languages en \\
81
- --output_dir ./search_results \\
82
- --encoder BAAI/bge-m3 \\
83
- --reranker BAAI/bge-reranker-v2-m3 \\
84
- --search_top_k 1000 \\
85
- --rerank_top_k 100 \\
86
- --max_query_length 512 \\
87
- --max_passage_length 512 \\
88
- --batch_size 512 \\
89
- --pooling_method cls \\
90
- --normalize_embeddings True \\
91
- --use_fp16 True \\
92
- --add_instruction False \\
93
- --overwrite False
94
-
95
- # Run the tasks in the specified task type, domains, and languages
96
- python run_air_benchmark.py \\
97
- --task_types qa \\
98
- --domains wiki web \\
99
- --languages en \\
100
- --output_dir ./search_results \\
101
- --encoder BAAI/bge-m3 \\
102
- --reranker BAAI/bge-reranker-v2-m3 \\
103
- --search_top_k 1000 \\
104
- --rerank_top_k 100 \\
105
- --max_query_length 512 \\
106
- --max_passage_length 512 \\
107
- --batch_size 512 \\
108
- --pooling_method cls \\
109
- --normalize_embeddings True \\
110
- --use_fp16 True \\
111
- --add_instruction False \\
112
- --overwrite False
113
- ```
114
- 3. Package the search results.
115
- ```bash
116
- # Zip "Embedding Model + NoReranker" search results in "<search_results>/<model_name>/NoReranker" to "<save_dir>/<model_name>_NoReranker.zip".
117
- python zip_results.py \\
118
- --results_dir search_results \\
119
- --model_name bge-m3 \\
120
- --save_dir search_results/zipped_results
121
-
122
- # Zip "Embedding Model + Reranker" search results in "<search_results>/<model_name>/<reranker_name>" to "<save_dir>/<model_name>_<reranker_name>.zip".
123
- python zip_results.py \\
124
- --results_dir search_results \\
125
- --model_name bge-m3 \\
126
- --reranker_name bge-reranker-v2-m3 \\
127
- --save_dir search_results/zipped_results
128
- ```
129
- 4. Upload the `.zip` file on this page and fill in the model information:
130
- - Model Name: such as `bge-m3`.
131
- - Model URL: such as `https://huggingface.co/BAAI/bge-m3`.
132
- - Reranker Name: such as `bge-reranker-v2-m3`. Keep empty for `NoReranker`.
133
- - Reranker URL: such as `https://huggingface.co/BAAI/bge-reranker-v2-m3`. Keep empty for `NoReranker`.
134
-
135
- If you want to stay anonymous, you can only fill in the Model Name and Reranker Name (keep empty for `NoReranker`), and check the selection box below befor submission.
136
 
137
- 5. Congratulation! Your results will be shown on the leaderboard in up to one hour.
 
138
  """
139
 
140
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 
4
 
5
  # What does your leaderboard evaluate?
6
  INTRODUCTION_TEXT = """
7
+ ## Check more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench)
8
  """
9
 
10
  # Which evaluations are you running? how can people reproduce what you have?
11
  BENCHMARKS_TEXT = f"""
12
+ ## How the test data are generated?
13
+ ### Find more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/data_generation.md)
14
 
15
+ ## FAQ
16
+ - Q: Will you release a new version of datasets regularly? How often will AIR-Bench release a new version?
17
+ - A: Yes, we plan to release new datasets on regular basis. However, the update frequency is to be decided.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ - Q: As you are using models to do the quality control when generating the data, is it biased to the models that are used?
20
+ - A: Yes, the results is biased to the chosen models. However, we believe the datasets labeled by human are also biased to the human's preference. The key point to verify is whether the model's bias is consistent with the human's. We use our approach to generate test data using the well established MSMARCO datasets. We benchmark different models' performances using the generated dataset and the human-label DEV dataset. Comparing the ranking of different models on these two datasets, we observe the spearman correlation between them is 0.8211 (p-value=5e-5). This indicates that the models' perference is well aligned with the human. Please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_evaluation_results.md#consistency-with-ms-marco) for details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ EVALUATION_QUEUE_TEXT = """
25
+ ## Check out the submission steps at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md)
26
  """
27
 
28
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"