Update README.md
Browse files
README.md
CHANGED
@@ -7,25 +7,12 @@ license: apache-2.0
|
|
7 |
---
|
8 |
# Model Card for Video-LLaVA - CinePile fine tune
|
9 |
|
10 |
-
|
|
|
11 |
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
## Model Sources
|
17 |
-
|
18 |
-
<!-- Provide the basic links for the model. -->
|
19 |
-
|
20 |
-
- **Repository:** [Github](https://github.com/mfarre/Video-LLaVA-7B-hf-CinePile) with fine-tunning and inference notebook.
|
21 |
-
## Uses
|
22 |
-
|
23 |
-
|
24 |
-
Although the model can answer questions based on the content, it is specifically optimized for addressing CinePile-related queries.
|
25 |
-
When the questions do not follow a CinePile-specific prompt, the inference section of the notebook is designed to refine and clean up the text produced by the model.
|
26 |
|
27 |
## Results
|
28 |
-
Extending CinePile's Model Evaluations [arxiv](https://arxiv.org/abs/2405.08813)
|
29 |
|
30 |
| Model | Average | Character and relationship dynamics | Narrative and Plot Analysis | Setting and Technical Analysis | Temporal | Theme Exploration |
|
31 |
|--------------------------------|---------|-------------------------------------|-----------------------------|--------------------------------|----------|-------------------|
|
@@ -37,8 +24,27 @@ Extending CinePile's Model Evaluations [arxiv](https://arxiv.org/abs/2405.08813)
|
|
37 |
| Gemini 1.5 Flash | 57.52 | 61.91 | 69.15 | 54.86 | 41.34 | 61.22 |
|
38 |
| Gemini Pro Vision | 50.64 | 54.16 | 65.5 | 46.97 | 35.8 | 58.82 |
|
39 |
| Claude 3 (Opus) | 45.6 | 48.89 | 57.88 | 40.73 | 37.65 | 47.89 |
|
40 |
-
| **Video LlaVa -
|
41 |
| Video LLaVa | 22.51 | 23.11 | 25.92 | 20.69 | 22.38 | 22.63 |
|
42 |
| mPLUG-Owl | 10.57 | 10.65 | 11.04 | 9.18 | 11.89 | 15.05 |
|
43 |
| Video-ChatGPT | 14.55 | 16.02 | 14.83 | 15.54 | 6.88 | 18.86 |
|
44 |
-
| MovieChat | 4.61 | 4.95 | 4.29 | 5.23 | 2.48 | 4.21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
---
|
8 |
# Model Card for Video-LLaVA - CinePile fine tune
|
9 |
|
10 |
+
Video multimodal research often emphasizes activity recognition and object-centered tasks, such as determining "what is the person wearing a red hat doing?" However, this focus overlooks areas like theme exploration, narrative and plot analysis, and character and relationship dynamics.
|
11 |
+
CinePile addresses these areas in their benchmark and they find that Large Language Models significantly lag behind human performance in these tasks. Additionally, there is a notable disparity in performance between open and closed models.
|
12 |
|
13 |
+
In our initial fine-tuning, our goal was to assess how well open models can approach the performance of closed models. By fine-tuning Video LlaVa, we achieved performance levels comparable to those of Claude 3 (Opus).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
## Results
|
|
|
16 |
|
17 |
| Model | Average | Character and relationship dynamics | Narrative and Plot Analysis | Setting and Technical Analysis | Temporal | Theme Exploration |
|
18 |
|--------------------------------|---------|-------------------------------------|-----------------------------|--------------------------------|----------|-------------------|
|
|
|
24 |
| Gemini 1.5 Flash | 57.52 | 61.91 | 69.15 | 54.86 | 41.34 | 61.22 |
|
25 |
| Gemini Pro Vision | 50.64 | 54.16 | 65.5 | 46.97 | 35.8 | 58.82 |
|
26 |
| Claude 3 (Opus) | 45.6 | 48.89 | 57.88 | 40.73 | 37.65 | 47.89 |
|
27 |
+
| **Video LlaVa - this fine-tune** | **44.16** | **45.26** | **45.14** | **46.93** | **32.55** | **49.47** |
|
28 |
| Video LLaVa | 22.51 | 23.11 | 25.92 | 20.69 | 22.38 | 22.63 |
|
29 |
| mPLUG-Owl | 10.57 | 10.65 | 11.04 | 9.18 | 11.89 | 15.05 |
|
30 |
| Video-ChatGPT | 14.55 | 16.02 | 14.83 | 15.54 | 6.88 | 18.86 |
|
31 |
+
| MovieChat | 4.61 | 4.95 | 4.29 | 5.23 | 2.48 | 4.21 |
|
32 |
+
|
33 |
+
|
34 |
+
|
35 |
+
|
36 |
+
Fine-tuned model taking as bases [Video-LlaVA](https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf) to evaluate its performance on CinePile.
|
37 |
+
|
38 |
+
|
39 |
+
|
40 |
+
## Model Sources
|
41 |
+
|
42 |
+
<!-- Provide the basic links for the model. -->
|
43 |
+
|
44 |
+
- **Repository:** [Github](https://github.com/mfarre/Video-LLaVA-7B-hf-CinePile) with fine-tunning and inference notebook.
|
45 |
+
## Uses
|
46 |
+
|
47 |
+
|
48 |
+
Although the model can answer questions based on the content, it is specifically optimized for addressing CinePile-related queries.
|
49 |
+
When the questions do not follow a CinePile-specific prompt, the inference section of the notebook is designed to refine and clean up the text produced by the model.
|
50 |
+
|