File size: 3,092 Bytes
f28aaa7
 
 
d22405d
f28aaa7
 
 
 
 
 
 
 
 
 
 
 
 
d22405d
 
 
 
 
 
 
 
 
 
 
 
 
f28aaa7
 
 
 
 
 
d22405d
 
 
 
 
 
 
 
 
 
 
 
 
f28aaa7
 
 
 
 
 
d22405d
 
 
 
 
 
 
 
 
 
 
 
 
 
f28aaa7
 
 
 
 
 
 
85310e6
f28aaa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85310e6
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
datasets:
- wchai/AuroraCap-trainset
base_model:
- lmsys/vicuna-7b-v1.5-16k
tags:
- caption
model-index:
- name: AuroraCap-7B
  results:
  - task:
      type: image caption
    dataset:
      type: Filckr
      name: Filckr
    metrics:
    - type: cider
      value: 88.9
    - type: bleu
      value: 75.6
      name: bleu@1
    - type: bleu
      value: 32.8
      name: bleu@4
    - type: meteor
      value: 26.7
    - type: rouge
      value: 55.4
      name: rouge-l
  - task:
      type: image caption
    dataset:
      type: NoCaps
      name: NoCaps
    metrics:
    - type: cider
      value: 111.4
    - type: bleu
      value: 85.6
      name: bleu@1
    - type: bleu
      value: 44.4
      name: bleu@4
    - type: meteor
      value: 29.9
    - type: rouge
      value: 60.6
      name: rouge-l
  - task:
      type: image caption
    dataset:
      type: COCO-Cap
      name: COCO-Cap
    metrics:
    - type: cider
      value: 120.8
    - type: bleu
      value: 78
      name: bleu@1
    - type: bleu
      value: 35.3
      name: bleu@4
    - type: meteor
      value: 28.6
    - type: rouge
      value: 57.2
      name: rouge-l
pipeline_tag: image-text-to-text
---

<img src="assets/teaser.png" align="center">

## Resources

- [Website](https://rese1f.github.io/aurora-web/)
- [arXiv: Paper](https://arxiv.org/abs/2410.03051)
- [GitHub: Code](https://github.com/rese1f/aurora)
- [Huggingface: AuroraCap Model](https://huggingface.co/collections/Reself/auroracap-66d117ffe13bedda96702013)
- [Huggingface: VDC Benchmark](https://huggingface.co/datasets/Reself/Video-Detailed-Caption)
- [Huggingface: Trainset](https://huggingface.co/datasets/Reself/AuroraCap-trainset)
  
## Features

<img src="assets/vdc_baseline.png" align="center">

AuroraCap is a multimodal large language model for image and video captioning. 

## Quick Start

see [Docs](https://github.com/rese1f/aurora/blob/main/docs/auroracap/README.md).

## FAQ

Q: Can I only use token merging during inference?

A: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.

Q: Why do we provide both official LLaVA-format and Xtuner format weights for AuroraCap?

A: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the official LLaVA format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.

## Citation

```
@article{chai2024auroracap,
  title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark },
  author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
  journal={arXiv preprint arXiv:2410.03051},
  year={2024}
}
```