File size: 19,885 Bytes
55500d6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# chatflash 弱点查询"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结论\n",
"- Perception\n",
" - Object Recognition: 76.8 -> 65.9 -> 50\n",
" - Action Recognition:80.2 -> 64.7 -> 55.6\n",
" - Attribute Perception: 82.8 -> 74.0 -> 55.6 (较少)\n",
" - Spatial Perception:83.3 -> 47.6 (较少) -> 0.0 (较少)\n",
" - Temporal Perception: 量较少\n",
"\n",
"Attribute Perception 似乎不太需要,medium 的分数就不错了\n",
"\n",
"- Reasoning\n",
" - Object Reasoning: 75.0 -> 66.4 -> 52.9\n",
" - Action Reasoning: 70.2 -> 65.5 -> 54.4\n",
" - Temporal Reasoning:83.3 -> 49.3 -> 47.3 (较少)\n",
" - Spatial Reasoning: 85.2 -> 77.8 -> 81.8\n",
"\n",
"- Others: \n",
" - Counting Problem: 56.8 -> 38.9 -> 31.2\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"path = '/share/minghao/VideoProjects/VideoChat-Flash/lmms-eval_videochat/videochat-flash-7B@448_eval_log_videomme.json'\n",
"\n",
"with open(path, 'r') as f:\n",
" datas = json.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"score_task_type = {}\n",
"score_duration_task_type = {\n",
" 'short': {},\n",
" 'medium': {},\n",
" 'long': {}\n",
"}\n",
"\n",
"for data in datas['logs']:\n",
" duration = data['doc']['duration']\n",
" task_type = data['doc']['task_type']\n",
" pred_answer = data['videomme_percetion_score']['pred_answer']\n",
" answer = data['videomme_percetion_score']['answer']\n",
" if pred_answer == answer:\n",
" correct_flag = True\n",
" else:\n",
" correct_flag = False\n",
"\n",
" if task_type not in score_task_type:\n",
" if correct_flag:\n",
" score_task_type[task_type] = {'Total':1, 'Correct':1}\n",
" else:\n",
" score_task_type[task_type] = {'Total':1, 'Correct':0}\n",
" else:\n",
" if correct_flag:\n",
" score_task_type[task_type]['Total'] += 1\n",
" score_task_type[task_type]['Correct'] += 1\n",
" else:\n",
" score_task_type[task_type]['Total'] += 1\n",
"\n",
"\n",
" if task_type not in score_duration_task_type[duration]:\n",
" if correct_flag:\n",
" score_duration_task_type[duration][task_type] = {'Total':1, 'Correct':1}\n",
" else:\n",
" score_duration_task_type[duration][task_type] = {'Total':1, 'Correct':0}\n",
" else:\n",
" if correct_flag:\n",
" score_duration_task_type[duration][task_type]['Total'] += 1\n",
" score_duration_task_type[duration][task_type]['Correct'] += 1\n",
" else:\n",
" score_duration_task_type[duration][task_type]['Total'] += 1\n",
"\n",
"\n",
"for task_type, info in score_task_type.items():\n",
" info['Acc'] = round(info['Correct'] / info['Total'] * 100, 1)\n",
" \n",
"for duration, task_type_info in score_duration_task_type.items():\n",
" for task_type, info in task_type_info.items():\n",
" info['Acc'] = round(info['Correct'] / info['Total'] * 100, 1)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Counting Problem': {'Total': 268, 'Correct': 123, 'Acc': 45.9},\n",
" 'Information Synopsis': {'Total': 323, 'Correct': 259, 'Acc': 80.2},\n",
" 'Object Recognition': {'Total': 354, 'Correct': 243, 'Acc': 68.6},\n",
" 'Action Reasoning': {'Total': 285, 'Correct': 169, 'Acc': 59.3},\n",
" 'Object Reasoning': {'Total': 454, 'Correct': 276, 'Acc': 60.8},\n",
" 'Temporal Perception': {'Total': 55, 'Correct': 47, 'Acc': 85.5},\n",
" 'Attribute Perception': {'Total': 222, 'Correct': 170, 'Acc': 76.6},\n",
" 'Temporal Reasoning': {'Total': 177, 'Correct': 88, 'Acc': 49.7},\n",
" 'Action Recognition': {'Total': 313, 'Correct': 217, 'Acc': 69.3},\n",
" 'OCR Problems': {'Total': 139, 'Correct': 90, 'Acc': 64.7},\n",
" 'Spatial Perception': {'Total': 54, 'Correct': 35, 'Acc': 64.8},\n",
" 'Spatial Reasoning': {'Total': 56, 'Correct': 46, 'Acc': 82.1}}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_task_type"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Temporal Reasoning: 49.7: short 的时候还好,但是视频长了就不行了\n",
"# Counting Problem: 45.9: 视频时间越长,分数越低,short 时也远远小于平均分\n",
"# Action Reasoning: 59.3: Action Reasoning 随着视频长度增加而增加,也是说明定位能力不行\n",
"# Object Reasoning 60.8: 和 Action Reasoning 一样"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'short': {'Counting Problem': {'Total': 125, 'Correct': 71, 'Acc': 56.8},\n",
" 'Information Synopsis': {'Total': 82, 'Correct': 72, 'Acc': 87.8},\n",
" 'Object Recognition': {'Total': 168, 'Correct': 129, 'Acc': 76.8},\n",
" 'Action Reasoning': {'Total': 47, 'Correct': 33, 'Acc': 70.2},\n",
" 'Object Reasoning': {'Total': 80, 'Correct': 60, 'Acc': 75.0},\n",
" 'Temporal Perception': {'Total': 18, 'Correct': 16, 'Acc': 88.9},\n",
" 'Attribute Perception': {'Total': 122, 'Correct': 101, 'Acc': 82.8},\n",
" 'Temporal Reasoning': {'Total': 13, 'Correct': 9, 'Acc': 69.2},\n",
" 'Action Recognition': {'Total': 131, 'Correct': 105, 'Acc': 80.2},\n",
" 'OCR Problems': {'Total': 57, 'Correct': 45, 'Acc': 78.9},\n",
" 'Spatial Perception': {'Total': 30, 'Correct': 25, 'Acc': 83.3},\n",
" 'Spatial Reasoning': {'Total': 27, 'Correct': 23, 'Acc': 85.2}},\n",
" 'medium': {'Counting Problem': {'Total': 95, 'Correct': 37, 'Acc': 38.9},\n",
" 'Object Reasoning': {'Total': 134, 'Correct': 89, 'Acc': 66.4},\n",
" 'Object Recognition': {'Total': 132, 'Correct': 87, 'Acc': 65.9},\n",
" 'OCR Problems': {'Total': 68, 'Correct': 40, 'Acc': 58.8},\n",
" 'Temporal Reasoning': {'Total': 73, 'Correct': 36, 'Acc': 49.3},\n",
" 'Information Synopsis': {'Total': 78, 'Correct': 64, 'Acc': 82.1},\n",
" 'Action Reasoning': {'Total': 58, 'Correct': 38, 'Acc': 65.5},\n",
" 'Spatial Perception': {'Total': 21, 'Correct': 10, 'Acc': 47.6},\n",
" 'Attribute Perception': {'Total': 73, 'Correct': 54, 'Acc': 74.0},\n",
" 'Action Recognition': {'Total': 119, 'Correct': 77, 'Acc': 64.7},\n",
" 'Spatial Reasoning': {'Total': 18, 'Correct': 14, 'Acc': 77.8},\n",
" 'Temporal Perception': {'Total': 31, 'Correct': 28, 'Acc': 90.3}},\n",
" 'long': {'Information Synopsis': {'Total': 163, 'Correct': 123, 'Acc': 75.5},\n",
" 'Object Reasoning': {'Total': 240, 'Correct': 127, 'Acc': 52.9},\n",
" 'Attribute Perception': {'Total': 27, 'Correct': 15, 'Acc': 55.6},\n",
" 'Action Reasoning': {'Total': 180, 'Correct': 98, 'Acc': 54.4},\n",
" 'Temporal Reasoning': {'Total': 91, 'Correct': 43, 'Acc': 47.3},\n",
" 'Object Recognition': {'Total': 54, 'Correct': 27, 'Acc': 50.0},\n",
" 'Temporal Perception': {'Total': 6, 'Correct': 3, 'Acc': 50.0},\n",
" 'Counting Problem': {'Total': 48, 'Correct': 15, 'Acc': 31.2},\n",
" 'Action Recognition': {'Total': 63, 'Correct': 35, 'Acc': 55.6},\n",
" 'Spatial Reasoning': {'Total': 11, 'Correct': 9, 'Acc': 81.8},\n",
" 'OCR Problems': {'Total': 14, 'Correct': 5, 'Acc': 35.7},\n",
" 'Spatial Perception': {'Total': 3, 'Correct': 0, 'Acc': 0.0}}}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_duration_task_type"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"statics_path = f'/share/minghao/VideoProjects/Sythesis2/Videomme/videomme.json'\n",
"with open(statics_path, 'w') as file:\n",
" json.dump(score_duration_task_type, file, indent=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 任务定义"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# videomme 先搞一下\n",
"import torch\n",
"import pandas as pd\n",
"\n",
"datas_path = '/share_2/mm_data_dir/data_1/Video-MME/videomme/test-00000-of-00001.parquet'\n",
"df = pd.read_parquet(datas_path)\n",
"datas_dict_list = df.to_dict(orient=\"records\")\n",
"record_datas_by_task_type = {}\n",
"\n",
"target_duration = ['medium', 'long']\n",
"\n",
"for data in datas_dict_list:\n",
" task_type = data['task_type']\n",
" duration = data['duration']\n",
" if duration not in target_duration:\n",
" continue\n",
" if task_type not in record_datas_by_task_type:\n",
" record_datas_by_task_type[task_type] = [data]\n",
" else:\n",
" record_datas_by_task_type[task_type].append(data)\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"########## Temporal Reasoning ##########\n",
"Q: What is the correct chronological order in which the following parts of the video appear?\n",
"(a) Human lungs.\n",
"(b) Protein folding distortion.\n",
"(c) Mice, plants and cells.\n",
"Options: A. (a)(b)(c). B. (c)(a)(b). C. (c)(b)(a). D. (b)(c)(a).\n",
"Answer: C\n",
"\n",
"\n",
"Q: What is the order in which the following long jump processes appear in the video?\n",
"① 8.53m Rome 2018\n",
"② 8.53m London 2018\n",
"③ 8.56m Shanghai 2018\n",
"④ 8.53m Birmingham 2018\n",
"Options: A. ②①④③. B. ④①②③. C. ②③①④. D. ①②③④.\n",
"Answer: A\n",
"\n",
"\n",
"Q: In what order are the following planets introduced in the video?\n",
"Options: A. Venus, Jupiter, Neptune. B. Mercury, Jupiter, Mars. C. Venus, Neptune, Jupiter. D. Jupiter, Mercury, Neptune.\n",
"Answer: A\n",
"\n",
"\n",
"Q: What is the second to last news item in this segment?\n",
"Options: A. CBC agrees to meeting on safer-supply drugs. B. Kate Middleton photo manipulation concern. C. Entrepreneur explains why local radio isn't dead. D. Canadians stranded in Haiti as violence escalates.\n",
"Answer: C\n",
"\n",
"\n",
"Q: What is the order in which the following is introduced in the video?\n",
"① Sleepwalking and the Brain.\n",
"② How Much Control Do We Have of Our Brain?\n",
"③ Emotions and the Brain.\n",
"④ Creativity and the Brain.\n",
"Options: A. ①②③④. B. ①③②④. C. ②①④③. D. ①③②④.\n",
"Answer: B\n",
"\n",
"\n",
"Q: Which of the following describes the heroine's weekly fitness programme correctly?\n",
"Options: A. Glutes and hamstrings on Monday, pulls on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, push on Friday and Cardio and Core on Saturday. B. Glutes and hamstrings on Monday, push on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, pulls on Friday and Cardio and Core on Saturday. C. Glutes and hamstrings on Monday, pulls on Tuesday, Wednesday off while the lady goes to the hairdresser's and gets her hair done, Quads and calves on Thursday, push on Friday and Cardio and Core on Saturday. D. None of the above.\n",
"Answer: B\n",
"\n",
"\n",
"Q: In what sequence are the following topics introduced in this video?\n",
"(a) Different types of grills.\n",
"(b) Venting.\n",
"(c) Lighting considerations.\n",
"(d) Cooking options.\n",
"(e) Outdoor kitchen configurations.\n",
"Options: A. (b)(c)(a)(d)(d). B. (d)(a)(c)(b)(e). C. (a)(c)(b)(d)(e). D. (c)(b)(a)(d)(a).\n",
"Answer: D\n",
"\n",
"\n",
"Q: What activities are recorded sequentially in the video?\n",
"Options: A. Explore ice sheet, take sled dog rides, visit residents. B. Explore ice sheet, visit residents, take sled dog rides. C. Take sled dog rides, visit residents, explore ice sheet. D. Visit residents, explore ice sheet, take sled dog rides.\n",
"Answer: A\n",
"\n",
"\n",
"Q: In which order do the following events happen in this video?\n",
"(a) Mr. Bean falls in love with the beautiful singer Roxy and has a wish to get an autograph from her. His attempts are mostly foiled by the bodyguard until he manages to get a kiss mark from Roxy using her handkerchief and Bean is happy.\n",
"(b) Whilst digging for treasure, Mr. Bean builds his own metal detector and goes to hunt for treasure but fails. When he manages to get the treasure and bring it to his flat.\n",
"(c) Mr. Bean and Irma are off for a day at the seaside, where his trunk gets accidentally swapped with that of a stage magician.\n",
"(d) Bean attends a hypnotism show. He unwittingly volunteers to be hypnotised; when the hypnotist makes him think he's a dog, he runs away and then back at home chases Scrapper around the garden and the house.\n",
"Options: A. (b)(a)(c)(d). B. (a)(c)(d)(b). C. (c)(d)(a)(b). D. (d)(c)(b)(a).\n",
"Answer: C\n",
"\n",
"\n",
"Q: Which animals are introduced in sequence in the video?\n",
"Options: A. Argentinosaurus, Giant Rhinoceros, Titanoboa, Leedsichthys. B. Leedsichthys, Argentinosaurus, Giant Rhinoceros, Titanoboa. C. Titanoboa, Leedsichthys, Argentinosaurus, Giant Rhinoceros. D. Giant Rhinoceros, Titanoboa, Leedsichthys, Argentinosaurus.\n",
"Answer: D\n",
"\n",
"\n",
"Q: When the people in the video saw the turtles for the first time, how many days into their journey was it?\n",
"Options: A. Day 2. B. Day 3. C. Day 1. D. Day 4.\n",
"Answer: B\n",
"\n",
"\n",
"Q: In what order do the 2012 London men's 100m final, the 2018 Beijing men's 100m final, and the 2012 London women's 100m final appear in the video?\n",
"Options: A. The 2008 men's 100m final in Beijing appeared first, then the 2012 London women's 100m final, and the 2012 London men's 100m final appeared last in the video. B. The 2012 London men's 100m final appears first, then the 2012 London women's 100m final, and the 2008 Beijing men's 100m final appears last in the video. C. The 2012 London women's 100m final appears first, then the 2012 London men's 100m final, and the 2008 Beijing men's 100m final appears last in the video. D. Neither.\n",
"Answer: C\n",
"\n",
"\n",
"Q: Which of the following options correctly sorts the sequence of events in the video?\n",
"Options: A. Walking, eating, hanging laundry, watching friends unbox watches. B. Eating, hanging laundry, walking, watching friends unboxing watches. C. Hanging laundry, eating, walking, watching friends unboxing watches. D. Hanging laundry, eating, watching friends unboxing watches, taking a walk.\n",
"Answer: D\n",
"\n",
"\n",
"Q: What is the order in which the following is introduced in the video?\n",
"① Insulin Resistance Explained.\n",
"② Case studies.\n",
"③ How to Reverse Type 2 Diabetes.\n",
"Options: A. ①②③. B. ①③②. C. ③①②. D. ①③②.\n",
"Answer: D\n",
"\n",
"\n",
"Q: In what order do the following events appear in the video?\n",
"①Receiving a parcel\n",
"②Making bibimbap\n",
"③Cleaning the floor\n",
"Options: A. ①②③. B. ②①③. C. ②③①. D. ③②①.\n",
"Answer: B\n",
"\n",
"\n",
"Q: Which of the following is the correct order of events in the video?\n",
"Options: A. The court receives case materials, defense attorneys meet with clients, key witnesses testify, and prosecutors compile evidence lists. B. The police arrested two suspects, the prosecutor discussed the case together, the defense lawyer discussed the composition of the jury, and the prosecutor produced a gun as evidence. C. Detectives collect evidence from the scene, prosecutors and police meet to discuss the case, and during court arguments, the defense raises objections to the evidence. D. Psychology experts testify in court, the prosecutor takes out a gun as evidence, the defense lawyer discusses the composition of the jury, and the prosecutor discusses the case together.\n",
"Answer: B\n",
"\n",
"\n",
"Q: According to this video, in which order do the following events happened when the performer made the first two pigeons?\n",
"(a) He duplicated the pigeon and produced another.\n",
"(b) He produced a pigeon from a burning leather.\n",
"(c) He put the pigeons into a cage.\n",
"Options: A. (a)(b)(c). B. (b)(c)(a). C. (c)(b)(a). D. (b)(a)(c).\n",
"Answer: D\n",
"\n",
"\n",
"Q: According to the order in the video, what is the fourth GPT introduced (except GPT Finder)?\n",
"Options: A. Diagrams. B. Logo Creator. C. Prompt Perfect. D. GPT Finder.\n",
"Answer: B\n",
"\n",
"\n",
"Q: How often do the people in the video take water breaks per workout?\n",
"Options: A. 10 minutes. B. 5 minutes. C. 15 minutes. D. No regular breaks.\n",
"Answer: B\n",
"\n",
"\n",
"Q: What is the order of the swimming strokes used by the athletes in the video?\n",
"Options: A. Backstroke, Breaststroke, Butterfly, Freestyle. B. Breaststroke, Butterfly, Backstroke, Freestyle. C. Butterfly, Backstroke, Breaststroke, Freestyle. D. Butterfly, freestyle, backstroke, breaststroke.\n",
"Answer: C\n",
"\n",
"\n"
]
}
],
"source": [
"import random\n",
"\n",
"check_task_type = ['Temporal Reasoning'] # 'Attribute Perception', 'Action Recognition'\n",
"\n",
"for task_type, datas in record_datas_by_task_type.items():\n",
" if task_type not in check_task_type:\n",
" continue\n",
"\n",
" print('#'*10, task_type,'#'*10)\n",
" samples = random.sample(datas,20)\n",
" for sample in samples:\n",
" print(f'Q: {sample[\"question\"]}')\n",
" print(f'Options: {\" \".join(sample[\"options\"])}')\n",
" print(f'Answer: {sample[\"answer\"]}')\n",
" print('\\n')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|