File size: 3,782 Bytes
49ab358 2028d59 49ab358 262910b 49ab358 262910b 49ab358 262910b 49ab358 262910b 49ab358 262910b 49ab358 262910b 49ab358 262910b 49ab358 dd0074c 49ab358 c079e80 49ab358 243adcd dd0074c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: apache-2.0
datasets:
- wchai/AuroraCap-trainset
base_model:
- lmsys/vicuna-7b-v1.5-16k
tags:
- caption
model-index:
- name: AuroraCap-7B
results:
- task:
type: video detailed caption
dataset:
type: VDC
name: VDC
metrics:
- type: Acc
value: 38.21
name: VDCScore
- type: Acc
value: 48.33
name: VDD
- type: cider
value: 9.51
- type: bleu
value: 30.9
name: bleu@1
- type: bleu
value: 4.06
name: bleu@4
- type: meteor
value: 19.09
- type: rouge
value: 21.58
name: rouge-l
- task:
type: video caption
dataset:
type: MSR-VTT
name: NSR-VTT
metrics:
- type: cider
value: 33.1
- type: bleu
value: 58.6
name: bleu@1
- type: bleu
value: 21
name: bleu@4
- type: meteor
value: 23.9
- type: rouge
value: 49.5
name: rouge-l
- task:
type: video caption
dataset:
type: VATEX
name: VATEX
metrics:
- type: cider
value: 33.8
- type: bleu
value: 57.1
name: bleu@1
- type: bleu
value: 18.4
name: bleu@4
- type: meteor
value: 19
- type: rouge
value: 40.8
name: rouge-l
- task:
type: video question anwering
dataset:
type: ActivityNet
name: ActivityNet
metrics:
- type: Acc
value: 61.8
- task:
type: video question anwering
dataset:
type: MSVD
name: MSVD
metrics:
- type: Acc
value: 62.6
- task:
type: video question anwering
dataset:
type: MSR-VTT
name: MSR-VTT
metrics:
- type: Acc
value: 43.5
- task:
type: video question anwering
dataset:
type: iVQA
name: iVQA
metrics:
- type: Acc
value: 55.2
pipeline_tag: video-text-to-text
---
<img src="assets/teaser.png" align="center">
## Resources
- [Website](https://rese1f.github.io/aurora-web/)
- [arXiv: Paper](https://arxiv.org/abs/2410.03051)
- [GitHub: Code](https://github.com/rese1f/aurora)
- [Huggingface: AuroraCap Model](https://huggingface.co/collections/Reself/auroracap-66d117ffe13bedda96702013)
- [Huggingface: VDC Benchmark](https://huggingface.co/datasets/Reself/Video-Detailed-Caption)
- [Huggingface: Trainset](https://huggingface.co/datasets/Reself/AuroraCap-trainset)
## Features
<img src="assets/assets_vdc_baseline.png" align="center">
AuroraCap is a multimodal large language model for image and video captioning.
## Quick Start
See [Docs](https://github.com/rese1f/aurora/blob/main/docs/auroracap/README.md).
## FAQ
Q: Can I only use token merging during inference?
A: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.
Q: Why do we provide both official LLaVA-format and Xtuner format weights for AuroraCap?
A: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the official LLaVA format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.
## Citation
```
@article{chai2024auroracap,
title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark },
author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
journal={arXiv preprint arXiv:2410.03051},
year={2024}
}
``` |