Fengzhe Zhou commited on
Commit
76c0e18
·
0 Parent(s):

initial commit

Browse files
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ scripts/
2
+ __pycache__/
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Physical AI Bench Leaderboard
3
+ emoji: 🤖
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ app_file: app.py
8
+ pinned: true
9
+ license: mit
10
+ short_description: Benchmark for Physical AI generation and understanding
11
+ sdk_version: 5.43.1
12
+ tags:
13
+ - leaderboard
14
+ - physical-ai
15
+ - world-models
16
+ - autonomous-driving
17
+ - robotics
18
+ - embodied-ai
19
+ ---
20
+
21
+ # Physical AI Bench Leaderboard
22
+
23
+ **Physical AI Bench (PAI-Bench)** is a comprehensive benchmark suite for evaluating physical AI generation and understanding across diverse scenarios including autonomous vehicles, robotics, industrial spaces, and everyday ego-centric environments.
24
+
25
+ ## Resources
26
+
27
+ - 🌐 [GitHub Repository](https://github.com/SHI-Labs/physical-ai-bench)
28
+ - 📊 [Predict Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-predict)
29
+ - 📊 [Transfer Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-transfer)
30
+ - 📊 [Reason Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-reason)
31
+
32
+ ## Citation
33
+
34
+ ```bibtex
35
+ @misc{PAIBench2025,
36
+ title={Physical AI Bench: A Comprehensive Benchmark for Physical AI Generation and Understanding},
37
+ author={Fengzhe Zhou and Jiannan Huang and Jialuo Li and Humphrey Shi},
38
+ year={2025},
39
+ url={https://github.com/SHI-Labs/physical-ai-bench}
40
+ }
41
+ ```
42
+
43
+ ---
44
+
45
+ # Configuration
46
+
47
+ Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
48
+
49
+ Results files should have the following format and be stored as json files:
50
+ ```json
51
+ {
52
+ "config": {
53
+ "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
54
+ "model_name": "path of the model on the hub: org/model",
55
+ "model_sha": "revision on the hub",
56
+ },
57
+ "results": {
58
+ "task_name": {
59
+ "metric_name": score,
60
+ },
61
+ "task_name2": {
62
+ "metric_name": score,
63
+ }
64
+ }
65
+ }
66
+ ```
67
+
68
+ Request files are created automatically by this tool.
69
+
70
+ If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
71
+
72
+ # Code logic for more complex edits
73
+
74
+ You'll find
75
+ - the main table' columns names and properties in `src/display/utils.py`
76
+ - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
77
+ - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
app.py ADDED
@@ -0,0 +1,617 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import pandas as pd
3
+
4
+
5
+ # Your leaderboard name
6
+ TITLE = """<h1 align="center" id="space-title">Physical AI Bench Leaderboard</h1>"""
7
+
8
+ # What does your leaderboard evaluate?
9
+ INTRODUCTION_TEXT = """
10
+ **Physical AI Bench (PAI-Bench)** is a comprehensive benchmark suite for evaluating physical AI generation and understanding across diverse scenarios including autonomous vehicles, robotics, industrial spaces, and everyday ego-centric environments.
11
+ """
12
+
13
+ # Which evaluations are you running? how can people reproduce what you have?
14
+ LLM_BENCHMARKS_TEXT = """
15
+ ## How it works
16
+
17
+ This leaderboard tracks model performance across three core dimensions:
18
+
19
+ - **🎨 Predict**: Evaluates world foundation models' ability to predict future states across 1,044 diverse physical scenarios
20
+ - **🔄 Transfer**: Focuses on world model generation with complex control signals, featuring 600 videos across robotic arm operations, autonomous driving, and ego-centric scenes
21
+ - **🧠 Reason**: Evaluates understanding and reasoning about physical scenes, with 1,214 embodied reasoning scenarios focused on autonomous vehicle actions
22
+
23
+ PAI-Bench covers multiple physical AI domains including autonomous driving, robotics, industrial spaces, physics simulations, human interactions, and common sense reasoning.
24
+
25
+ ### Resources
26
+ - 🌐 [GitHub Repository](https://github.com/SHI-Labs/physical-ai-bench)
27
+ - 📊 [Predict Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-predict)
28
+ - 📊 [Transfer Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-transfer)
29
+ - 📊 [Reason Dataset](https://huggingface.co/datasets/shi-labs/physical-ai-bench-reason)
30
+
31
+ ## Reproducibility
32
+
33
+ To evaluate your models on PAI-Bench, visit our [GitHub repository](https://github.com/SHI-Labs/physical-ai-bench) for evaluation scripts and detailed instructions.
34
+
35
+ ## Citation
36
+
37
+ If you use Physical AI Bench in your research, please cite:
38
+
39
+ ```bibtex
40
+ @misc{{PAIBench2025,
41
+ title={{Physical AI Bench: A Comprehensive Benchmark for Physical AI Generation and Understanding}},
42
+ author={{Fengzhe Zhou and Jiannan Huang and Jialuo Li and Humphrey Shi}},
43
+ year={{2025}},
44
+ url={{https://github.com/SHI-Labs/physical-ai-bench}}
45
+ }}
46
+ ```
47
+ """
48
+
49
+
50
+ # ============================================================================
51
+ # Model Links Utility
52
+ # ============================================================================
53
+
54
+ def create_model_link(model_name):
55
+ """
56
+ Convert a model name to a markdown link to Hugging Face.
57
+
58
+ Args:
59
+ model_name: Model name in format "org/model-name" or just a plain name
60
+
61
+ Returns:
62
+ Markdown formatted link or original name if format doesn't match
63
+ """
64
+ if not isinstance(model_name, str):
65
+ return model_name
66
+
67
+ # Check if the model name follows the "org/model" format
68
+ if '/' in model_name:
69
+ # This is likely a HuggingFace model ID
70
+ hf_url = f"https://huggingface.co/{model_name}"
71
+ return f"[{model_name}]({hf_url})"
72
+
73
+ # If it doesn't have a slash, return as-is
74
+ return model_name
75
+
76
+
77
+ # ============================================================================
78
+ # Predict Tab Configuration and Utilities
79
+ # ============================================================================
80
+
81
+ # Column name mapping (from original name to display name)
82
+ PREDICT_COLUMN_NAME_MAPPING = {
83
+ 'Common+Misc': 'Common Sense',
84
+ 'BG Consistency': 'Background Consistency',
85
+ 'Motion': 'Motion Smoothness',
86
+ 'Aesthetic': 'Aesthetic Quality',
87
+ 'I2V BG': 'I2V Background'
88
+ }
89
+
90
+ # Columns to remove from the dataframe
91
+ PREDICT_COLUMNS_TO_REMOVE = ['Avg Score/Video', 'Common', 'Misc']
92
+
93
+ # Desired column order (using renamed column names)
94
+ PREDICT_COLUMN_ORDER = [
95
+ 'model',
96
+ 'Overall',
97
+ 'Domain Score',
98
+ 'Quality Score',
99
+ 'Common Sense',
100
+ 'AV',
101
+ 'Robot',
102
+ 'Industry',
103
+ 'Human',
104
+ 'Physics',
105
+ 'Subject Consistency',
106
+ 'Background Consistency',
107
+ 'Motion Smoothness',
108
+ 'Aesthetic Quality',
109
+ 'Image Quality',
110
+ 'Overall Consistency',
111
+ 'I2V Subject',
112
+ 'I2V Background',
113
+ 'params',
114
+ 'activate_params'
115
+ ]
116
+
117
+ # Columns to hide by default (but still available for filtering/selection)
118
+ PREDICT_HIDDEN_COLUMNS = ['params', 'activate_params']
119
+
120
+ # Semantic/Domain dimensions (for selection button)
121
+ PREDICT_DOMAIN_SCORE_DIMENSIONS = [
122
+ 'Domain Score',
123
+ 'Common Sense',
124
+ 'AV',
125
+ 'Robot',
126
+ 'Industry',
127
+ 'Human',
128
+ 'Physics',
129
+ ]
130
+
131
+ # Quality dimensions (for selection button)
132
+ PREDICT_QUALITY_SCORE_DIMENSIONS = [
133
+ 'Quality Score',
134
+ 'Subject Consistency',
135
+ 'Background Consistency',
136
+ 'Motion Smoothness',
137
+ 'Aesthetic Quality',
138
+ 'Image Quality',
139
+ 'Overall Consistency',
140
+ 'I2V Subject',
141
+ 'I2V Background'
142
+ ]
143
+
144
+ PREDICT_DESELECTED_COLUMNS = ['Domain Score', 'Quality Score']
145
+
146
+ PREDICT_ALL_SELECTED_COLUMNS = [
147
+ 'Domain Score',
148
+ 'Quality Score',
149
+ 'Common Sense',
150
+ 'AV',
151
+ 'Robot',
152
+ 'Industry',
153
+ 'Human',
154
+ 'Physics',
155
+ 'Subject Consistency',
156
+ 'Background Consistency',
157
+ 'Motion Smoothness',
158
+ 'Aesthetic Quality',
159
+ 'Image Quality',
160
+ 'Overall Consistency',
161
+ 'I2V Subject',
162
+ 'I2V Background'
163
+ ]
164
+
165
+ # Columns that can never be deselected
166
+ PREDICT_NEVER_HIDDEN_COLUMNS = ['model', 'Overall']
167
+
168
+ # Columns displayed by default (using renamed column names)
169
+ PREDICT_DEFAULT_DISPLAYED_COLUMNS = PREDICT_NEVER_HIDDEN_COLUMNS + PREDICT_ALL_SELECTED_COLUMNS
170
+
171
+ def load_predict_csv(csv_path):
172
+ """Load CSV and apply column ordering"""
173
+ df = pd.read_csv(csv_path)
174
+
175
+ # Remove specified columns
176
+ df = df.drop(columns=PREDICT_COLUMNS_TO_REMOVE, errors='ignore')
177
+
178
+ # Rename columns according to mapping
179
+ df = df.rename(columns=PREDICT_COLUMN_NAME_MAPPING)
180
+
181
+ # Reorder columns (only keep columns that exist in the dataframe)
182
+ available_cols = [col for col in PREDICT_COLUMN_ORDER if col in df.columns]
183
+ df = df[available_cols]
184
+
185
+ # Convert model names to HuggingFace links
186
+ if 'model' in df.columns:
187
+ df['model'] = df['model'].apply(create_model_link)
188
+
189
+ # Format numbers to ensure decimal places (1 decimal for numeric columns)
190
+ for col in df.columns:
191
+ if col not in ['model', 'params', 'activate_params'] and pd.api.types.is_numeric_dtype(df[col]):
192
+ df[col] = df[col].apply(lambda x: f"{x:.1f}" if pd.notna(x) else x)
193
+
194
+ return df
195
+
196
+
197
+ def select_predict_domain_score():
198
+ """Return domain score for checkbox selection"""
199
+ return gr.update(value=PREDICT_DOMAIN_SCORE_DIMENSIONS)
200
+
201
+ def select_predict_quality_score():
202
+ """Return quality score for checkbox selection"""
203
+ return gr.update(value=PREDICT_QUALITY_SCORE_DIMENSIONS)
204
+
205
+ def deselect_predict_all():
206
+ """Deselect all dimensions"""
207
+ return gr.update(value=PREDICT_DESELECTED_COLUMNS)
208
+
209
+ def select_predict_all():
210
+ """Select all dimensions"""
211
+ return gr.update(value=PREDICT_ALL_SELECTED_COLUMNS)
212
+
213
+ def on_predict_dimension_selection_change(selected_columns, full_df):
214
+ """Handle dimension selection changes and update the dataframe"""
215
+ # Always include model and Overall columns
216
+ present_columns = ['model', 'Overall']
217
+
218
+ # Add selected columns
219
+ for col in selected_columns:
220
+ if col not in present_columns and col in full_df.columns:
221
+ present_columns.append(col)
222
+
223
+ # Filter dataframe to show only selected columns
224
+ updated_data = full_df[present_columns]
225
+
226
+ # Determine datatypes
227
+ datatypes = []
228
+ for col in present_columns:
229
+ if col == 'model':
230
+ datatypes.append('markdown')
231
+ elif col in ['params', 'activate_params']:
232
+ datatypes.append('number')
233
+ else:
234
+ datatypes.append('str')
235
+
236
+ return gr.update(value=updated_data, datatype=datatypes, headers=present_columns)
237
+
238
+
239
+ def init_predict_leaderboard(dataframe):
240
+ """Initialize the Predict leaderboard with given dataframe"""
241
+ if dataframe is None or dataframe.empty:
242
+ raise ValueError("Leaderboard DataFrame is empty or None.")
243
+
244
+ # Get columns that exist in the dataframe
245
+ available_default_cols = [col for col in PREDICT_DEFAULT_DISPLAYED_COLUMNS if col in dataframe.columns]
246
+
247
+ # Filter dataframe to show only default columns initially
248
+ display_df = dataframe[available_default_cols]
249
+
250
+ # Determine datatypes dynamically
251
+ datatypes = []
252
+ for col in display_df.columns:
253
+ if col == 'model':
254
+ datatypes.append('markdown')
255
+ elif col in ['params', 'activate_params']:
256
+ datatypes.append('number')
257
+ else:
258
+ datatypes.append('str') # All numeric columns are now formatted as strings
259
+
260
+ # Create the UI components
261
+ with gr.Row():
262
+ with gr.Column(scale=1):
263
+ domain_score_btn = gr.Button("Domain Score", size="md")
264
+ quality_score_btn = gr.Button("Quality Score", size="md")
265
+ select_all_btn = gr.Button("Select All", size="md")
266
+ deselect_btn = gr.Button("Deselect All", size="md")
267
+
268
+ with gr.Column(scale=4):
269
+ # Get all dimension columns (exclude model, Overall, scores, and params)
270
+ dimension_choices = [col for col in dataframe.columns
271
+ if col not in PREDICT_NEVER_HIDDEN_COLUMNS + PREDICT_HIDDEN_COLUMNS]
272
+
273
+ checkbox_group = gr.CheckboxGroup(
274
+ choices=dimension_choices,
275
+ value=[col for col in PREDICT_DEFAULT_DISPLAYED_COLUMNS if col in dimension_choices],
276
+ label="Evaluation Dimensions",
277
+ interactive=True,
278
+ )
279
+
280
+ data_component = gr.Dataframe(
281
+ value=display_df,
282
+ headers=list(display_df.columns),
283
+ datatype=datatypes,
284
+ interactive=False,
285
+ visible=True,
286
+ wrap=False,
287
+ column_widths=["320px"] + ["200px"] * (len(display_df.columns) - 1),
288
+ pinned_columns=1,
289
+ )
290
+
291
+ # Setup event handlers
292
+ domain_score_btn.click(
293
+ select_predict_domain_score,
294
+ inputs=None,
295
+ outputs=[checkbox_group]
296
+ ).then(
297
+ fn=on_predict_dimension_selection_change,
298
+ inputs=[checkbox_group, gr.State(dataframe)],
299
+ outputs=data_component
300
+ )
301
+
302
+ quality_score_btn.click(
303
+ select_predict_quality_score,
304
+ inputs=None,
305
+ outputs=[checkbox_group]
306
+ ).then(
307
+ fn=on_predict_dimension_selection_change,
308
+ inputs=[checkbox_group, gr.State(dataframe)],
309
+ outputs=data_component
310
+ )
311
+
312
+ deselect_btn.click(
313
+ deselect_predict_all,
314
+ inputs=None,
315
+ outputs=[checkbox_group]
316
+ ).then(
317
+ fn=on_predict_dimension_selection_change,
318
+ inputs=[checkbox_group, gr.State(dataframe)],
319
+ outputs=data_component
320
+ )
321
+
322
+ select_all_btn.click(
323
+ select_predict_all,
324
+ inputs=None,
325
+ outputs=[checkbox_group]
326
+ ).then(
327
+ fn=on_predict_dimension_selection_change,
328
+ inputs=[checkbox_group, gr.State(dataframe)],
329
+ outputs=data_component
330
+ )
331
+
332
+ checkbox_group.change(
333
+ fn=on_predict_dimension_selection_change,
334
+ inputs=[checkbox_group, gr.State(dataframe)],
335
+ outputs=data_component
336
+ )
337
+
338
+ return data_component
339
+
340
+
341
+ # ============================================================================
342
+ # Reason Tab Configuration and Utilities
343
+ # ============================================================================
344
+
345
+ # Column name mapping for display
346
+ REASON_COLUMN_MAPPING = {
347
+ 'Physical world': 'Physics'
348
+ }
349
+
350
+ # Desired column order
351
+ REASON_COLUMN_ORDER = [
352
+ 'model',
353
+ 'Overall',
354
+ 'Common Sense',
355
+ 'Embodied Reasoning',
356
+ 'Space',
357
+ 'Time',
358
+ 'Physics',
359
+ 'BridgeData V2',
360
+ 'RoboVQA',
361
+ 'RoboFail',
362
+ 'Agibot',
363
+ 'HoloAssist',
364
+ 'AV',
365
+ 'params',
366
+ 'activate_params'
367
+ ]
368
+
369
+ # Columns to hide by default (but still available for filtering/selection)
370
+ REASON_HIDDEN_COLUMNS = ['params', 'activate_params']
371
+
372
+ # Reasoning dimensions (for selection button)
373
+ REASON_COMMON_SENSE_DIMENSIONS = [
374
+ 'Common Sense',
375
+ 'Space',
376
+ 'Time',
377
+ 'Physics',
378
+ ]
379
+
380
+ # Domain dimensions (for selection button)
381
+ REASON_EMBODIED_REASONING_DIMENSIONS = [
382
+ 'Embodied Reasoning',
383
+ 'Space',
384
+ 'Time',
385
+ 'Physics',
386
+ 'BridgeData V2',
387
+ 'RoboVQA',
388
+ 'RoboFail',
389
+ 'Agibot',
390
+ 'HoloAssist',
391
+ 'AV',
392
+ ]
393
+
394
+ REASON_DESELECTED_COLUMNS = [
395
+ 'Common Sense',
396
+ 'Embodied Reasoning',
397
+ ]
398
+
399
+ REASON_ALL_SELECTED_COLUMNS = [
400
+ 'Common Sense',
401
+ 'Embodied Reasoning',
402
+ 'Space',
403
+ 'Time',
404
+ 'Physics',
405
+ 'BridgeData V2',
406
+ 'RoboVQA',
407
+ 'RoboFail',
408
+ 'Agibot',
409
+ 'HoloAssist',
410
+ 'AV',
411
+ ]
412
+
413
+ # Columns that can never be deselected
414
+ REASON_NEVER_HIDDEN_COLUMNS = ['model', 'Overall']
415
+
416
+ # Columns displayed by default (using renamed column names)
417
+ REASON_DEFAULT_DISPLAYED_COLUMNS = REASON_NEVER_HIDDEN_COLUMNS + REASON_ALL_SELECTED_COLUMNS
418
+
419
+
420
+ def load_reason_csv(csv_path):
421
+ """Load CSV and apply column mapping and ordering"""
422
+ df = pd.read_csv(csv_path)
423
+
424
+ # Apply column mapping
425
+ df = df.rename(columns=REASON_COLUMN_MAPPING)
426
+
427
+ # Reorder columns (only keep columns that exist in the dataframe)
428
+ available_cols = [col for col in REASON_COLUMN_ORDER if col in df.columns]
429
+ df = df[available_cols]
430
+
431
+ # Convert model names to HuggingFace links
432
+ if 'model' in df.columns:
433
+ df['model'] = df['model'].apply(create_model_link)
434
+
435
+ # Format numbers to ensure decimal places (1 decimal for integers)
436
+ for col in df.columns:
437
+ if col not in ['model', 'params', 'activate_params'] and pd.api.types.is_numeric_dtype(df[col]):
438
+ df[col] = df[col].apply(lambda x: f"{x:.1f}" if pd.notna(x) else x)
439
+
440
+ return df
441
+
442
+
443
+ def select_reason_common_sense_dimensions():
444
+ """Return reasoning dimensions for checkbox selection"""
445
+ return gr.update(value=REASON_COMMON_SENSE_DIMENSIONS)
446
+
447
+
448
+ def select_reason_embodied_reasoning_dimensions():
449
+ """Return domain dimensions for checkbox selection"""
450
+ return gr.update(value=REASON_EMBODIED_REASONING_DIMENSIONS)
451
+
452
+
453
+ def deselect_reason_all():
454
+ """Deselect all dimensions"""
455
+ return gr.update(value=REASON_DESELECTED_COLUMNS)
456
+
457
+
458
+ def select_reason_all():
459
+ """Select all dimensions"""
460
+ return gr.update(value=REASON_ALL_SELECTED_COLUMNS)
461
+
462
+
463
+ def on_reason_dimension_selection_change(selected_columns, full_df):
464
+ """Handle dimension selection changes and update the dataframe"""
465
+ # Always include model and Overall columns
466
+ present_columns = ['model', 'Overall']
467
+
468
+ # Add selected columns
469
+ for col in selected_columns:
470
+ if col not in present_columns and col in full_df.columns:
471
+ present_columns.append(col)
472
+
473
+ # Filter dataframe to show only selected columns
474
+ updated_data = full_df[present_columns]
475
+
476
+ # Determine datatypes
477
+ datatypes = []
478
+ for col in present_columns:
479
+ if col == 'model':
480
+ datatypes.append('markdown')
481
+ elif col in ['params', 'activate_params']:
482
+ datatypes.append('number')
483
+ else:
484
+ datatypes.append('str')
485
+
486
+ return gr.update(value=updated_data, datatype=datatypes, headers=present_columns)
487
+
488
+
489
+ def init_reason_leaderboard(dataframe):
490
+ """Initialize the Reason leaderboard with given dataframe"""
491
+ if dataframe is None or dataframe.empty:
492
+ raise ValueError("Leaderboard DataFrame is empty or None.")
493
+
494
+ # Get columns that exist in the dataframe
495
+ available_default_cols = [col for col in REASON_DEFAULT_DISPLAYED_COLUMNS if col in dataframe.columns]
496
+
497
+ # Filter dataframe to show only default columns initially
498
+ display_df = dataframe[available_default_cols]
499
+
500
+ # Determine datatypes dynamically
501
+ datatypes = []
502
+ for col in display_df.columns:
503
+ if col == 'model':
504
+ datatypes.append('markdown')
505
+ elif col in ['params', 'activate_params']:
506
+ datatypes.append('number')
507
+ else:
508
+ datatypes.append('str') # All numeric columns are now formatted as strings
509
+
510
+ # Create the UI components
511
+ with gr.Row():
512
+ with gr.Column(scale=1):
513
+ common_sense_btn = gr.Button("Common Sense", size="md")
514
+ embodied_reasoning_btn = gr.Button("Embodied Reasoning", size="md")
515
+ select_all_btn = gr.Button("Select All", size="md")
516
+ deselect_btn = gr.Button("Deselect All", size="md")
517
+
518
+ with gr.Column(scale=4):
519
+ # Get all dimension columns (exclude model, Overall, and params)
520
+ dimension_choices = [col for col in dataframe.columns
521
+ if col not in REASON_NEVER_HIDDEN_COLUMNS + REASON_HIDDEN_COLUMNS]
522
+
523
+ checkbox_group = gr.CheckboxGroup(
524
+ choices=dimension_choices,
525
+ value=[col for col in REASON_DEFAULT_DISPLAYED_COLUMNS if col in dimension_choices],
526
+ label="Evaluation Dimensions",
527
+ interactive=True,
528
+ )
529
+
530
+ data_component = gr.Dataframe(
531
+ value=display_df,
532
+ headers=list(display_df.columns),
533
+ datatype=datatypes,
534
+ interactive=False,
535
+ visible=True,
536
+ wrap=False, # Allow horizontal scrolling, don't wrap content
537
+ column_widths=["320px"] + ["200px"] * (len(display_df.columns) - 1),
538
+ pinned_columns=1,
539
+ )
540
+
541
+ # Setup event handlers
542
+ common_sense_btn.click(
543
+ select_reason_common_sense_dimensions,
544
+ inputs=None,
545
+ outputs=[checkbox_group]
546
+ ).then(
547
+ fn=on_reason_dimension_selection_change,
548
+ inputs=[checkbox_group, gr.State(dataframe)],
549
+ outputs=data_component
550
+ )
551
+
552
+ embodied_reasoning_btn.click(
553
+ select_reason_embodied_reasoning_dimensions,
554
+ inputs=None,
555
+ outputs=[checkbox_group]
556
+ ).then(
557
+ fn=on_reason_dimension_selection_change,
558
+ inputs=[checkbox_group, gr.State(dataframe)],
559
+ outputs=data_component
560
+ )
561
+
562
+ deselect_btn.click(
563
+ deselect_reason_all,
564
+ inputs=None,
565
+ outputs=[checkbox_group]
566
+ ).then(
567
+ fn=on_reason_dimension_selection_change,
568
+ inputs=[checkbox_group, gr.State(dataframe)],
569
+ outputs=data_component
570
+ )
571
+
572
+ select_all_btn.click(
573
+ select_reason_all,
574
+ inputs=None,
575
+ outputs=[checkbox_group]
576
+ ).then(
577
+ fn=on_reason_dimension_selection_change,
578
+ inputs=[checkbox_group, gr.State(dataframe)],
579
+ outputs=data_component
580
+ )
581
+
582
+ checkbox_group.change(
583
+ fn=on_reason_dimension_selection_change,
584
+ inputs=[checkbox_group, gr.State(dataframe)],
585
+ outputs=data_component
586
+ )
587
+
588
+ return data_component
589
+
590
+
591
+ # ============================================================================
592
+ # Main Application
593
+ # ============================================================================
594
+
595
+ demo = gr.Blocks()
596
+ with demo:
597
+ gr.HTML(TITLE)
598
+ gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
599
+
600
+ with gr.Tabs(elem_classes="tab-buttons") as tabs:
601
+ with gr.TabItem("🎨 Predict", elem_id="predict-tab", id=0):
602
+ # Load data for Predict tab
603
+ predict_df = load_predict_csv("data/predict-leaderboard.csv")
604
+ predict_leaderboard = init_predict_leaderboard(predict_df)
605
+
606
+ with gr.TabItem("🔄 Transfer", elem_id="transfer-tab", id=1):
607
+ gr.Markdown("## Coming Soon", elem_classes="markdown-text")
608
+
609
+ with gr.TabItem("🧠 Reason", elem_id="reason-tab", id=2):
610
+ # Load data for Reason tab
611
+ reason_df = load_reason_csv("data/reason-leaderboard.csv")
612
+ reason_leaderboard = init_reason_leaderboard(reason_df)
613
+
614
+ with gr.TabItem("ℹ️ About", elem_id="about-tab", id=3):
615
+ gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
616
+
617
+ demo.launch()
data/predict-leaderboard.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ model,params,activate_params,Overall,AV,Common,Human,Industry,Misc,Physics,Robot,Avg Score/Video,Common+Misc,Domain Score,Aesthetic,BG Consistency,Image Quality,Motion,Overall Consistency,Subject Consistency,I2V BG,I2V Subject,Quality Score
2
+ nvidia/Cosmos-Predict2.5-2B,2.0,2.0,81.0,66.1,95.9,81.4,87.8,91.0,93.9,80.8,84.4,94.1,84.0,52.4,94.2,70.8,99.1,20.1,92.5,97.4,96.6,77.9
3
+ Wan-AI/Wan2.2-I2V-A14B,14.0,14.0,80.6,66.3,94.6,82.1,89.2,90.9,91.8,81.7,84.5,93.2,84.1,51.2,93.7,69.6,98.3,20.4,91.6,96.6,96.0,77.2
4
+ Wan-AI/Wan2.2-TI2V-5B,5.0,5.0,80.4,65.2,95.3,83.0,88.4,89.6,91.5,79.3,84.1,93.1,83.4,51.9,93.7,69.9,98.8,20.3,91.8,96.7,95.9,77.4
5
+ Wan-AI/Wan2.1-I2V-14B-720P,14.0,14.0,79.7,66.9,93.7,80.1,89.7,85.5,88.7,80.1,82.9,90.6,82.7,51.5,93.1,70.1,98.1,20.4,90.0,96.0,95.2,76.8
data/reason-leaderboard.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ model,params,activate_params,Overall,AV,Agibot,BridgeData V2,Common Sense,Embodied Reasoning,HoloAssist,Physics,RoboFail,RoboVQA,Space,Time
2
+ Qwen/Qwen3-VL-30B-A3B-Instruct,30.0,3.0,60.6,49.0,43.0,36.0,59.9,61.3,81.0,59.7,67.0,89.1,52.5,62.1
3
+ Qwen/Qwen2.5-VL-72B-Instruct,72.0,72.0,56.8,39.0,35.0,35.0,57.9,55.7,58.0,52.2,73.0,90.9,56.2,62.8
4
+ nvidia/Cosmos-Reason1-7B,7.0,7.0,54.3,47.0,42.0,41.0,50.7,57.9,57.0,44.2,65.0,91.8,57.5,53.7
5
+ Qwen/Qwen2.5-VL-32B-Instruct,32.0,32.0,51.9,33.0,34.0,32.0,53.8,50.0,55.0,45.6,52.0,90.0,50.0,61.1
6
+ Qwen/Qwen2.5-VL-7B-Instruct,7.0,7.0,50.3,45.0,44.0,33.0,47.7,53.0,47.0,37.6,62.0,83.6,47.5,55.4
7
+ Qwen/Qwen2.5-VL-3B-Instruct,3.0,3.0,48.1,29.0,36.0,31.0,47.4,48.9,48.0,42.9,63.0,82.7,47.5,50.7
8
+ Qwen/Qwen2-VL-2B-Instruct,2.0,2.0,40.0,51.0,24.0,25.0,44.5,35.4,28.0,41.2,34.0,49.1,32.5,50.3
9
+ Qwen/Qwen2-VL-72B-Instruct,72.0,72.0,40.0,25.0,31.0,28.0,45.0,34.9,21.0,40.3,49.0,53.6,50.0,47.3
10
+ Qwen/Qwen2-VL-7B-Instruct,7.0,7.0,38.8,24.0,28.0,28.0,44.5,33.1,26.0,44.7,38.0,52.7,38.8,46.0
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio
2
+ pandas