Commit
•
c89d507
1
Parent(s):
b3b4609
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,18 @@
|
|
1 |
---
|
2 |
license: llama2
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
### Inference for Video Understanding
|
6 |
```python
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
---
|
4 |
+
## 😮 Highlights
|
5 |
+
|
6 |
+
### 💡 Unified visual representation for image and video
|
7 |
+
We employ **a set of dynamic visual tokens** to uniformly represent images and videos.
|
8 |
+
This representation framework empowers the model to efficiently utilize **a limited number of visual tokens** to simultaneously capture **the spatial details necessary for images** and **the comprehensive temporal relationship required for videos**.
|
9 |
+
|
10 |
+
### 🔥 Joint training strategy, making LLMs understand both image and video
|
11 |
+
Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications.
|
12 |
+
|
13 |
+
### 🤗 High performance, complementary learning with image and video
|
14 |
+
Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.
|
15 |
+
|
16 |
|
17 |
### Inference for Video Understanding
|
18 |
```python
|