mfarre HF staff commited on
Commit
9f23e71
1 Parent(s): 8666404

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -18
README.md CHANGED
@@ -7,25 +7,12 @@ license: apache-2.0
7
  ---
8
  # Model Card for Video-LLaVA - CinePile fine tune
9
 
10
- <!-- Provide a quick summary of what the model is/does. -->
 
11
 
12
- Fine-tuned model taking as bases [Video-LlaVA](https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf) to evaluate its performance on CinePile.
13
-
14
-
15
-
16
- ## Model Sources
17
-
18
- <!-- Provide the basic links for the model. -->
19
-
20
- - **Repository:** [Github](https://github.com/mfarre/Video-LLaVA-7B-hf-CinePile) with fine-tunning and inference notebook.
21
- ## Uses
22
-
23
-
24
- Although the model can answer questions based on the content, it is specifically optimized for addressing CinePile-related queries.
25
- When the questions do not follow a CinePile-specific prompt, the inference section of the notebook is designed to refine and clean up the text produced by the model.
26
 
27
  ## Results
28
- Extending CinePile's Model Evaluations [arxiv](https://arxiv.org/abs/2405.08813)
29
 
30
  | Model | Average | Character and relationship dynamics | Narrative and Plot Analysis | Setting and Technical Analysis | Temporal | Theme Exploration |
31
  |--------------------------------|---------|-------------------------------------|-----------------------------|--------------------------------|----------|-------------------|
@@ -37,8 +24,27 @@ Extending CinePile's Model Evaluations [arxiv](https://arxiv.org/abs/2405.08813)
37
  | Gemini 1.5 Flash | 57.52 | 61.91 | 69.15 | 54.86 | 41.34 | 61.22 |
38
  | Gemini Pro Vision | 50.64 | 54.16 | 65.5 | 46.97 | 35.8 | 58.82 |
39
  | Claude 3 (Opus) | 45.6 | 48.89 | 57.88 | 40.73 | 37.65 | 47.89 |
40
- | **Video LlaVa - CinePile fine tune** | **44.16** | **45.26** | **45.14** | **46.93** | **32.55** | **49.47** |
41
  | Video LLaVa | 22.51 | 23.11 | 25.92 | 20.69 | 22.38 | 22.63 |
42
  | mPLUG-Owl | 10.57 | 10.65 | 11.04 | 9.18 | 11.89 | 15.05 |
43
  | Video-ChatGPT | 14.55 | 16.02 | 14.83 | 15.54 | 6.88 | 18.86 |
44
- | MovieChat | 4.61 | 4.95 | 4.29 | 5.23 | 2.48 | 4.21 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
  # Model Card for Video-LLaVA - CinePile fine tune
9
 
10
+ Video multimodal research often emphasizes activity recognition and object-centered tasks, such as determining "what is the person wearing a red hat doing?" However, this focus overlooks areas like theme exploration, narrative and plot analysis, and character and relationship dynamics.
11
+ CinePile addresses these areas in their benchmark and they find that Large Language Models significantly lag behind human performance in these tasks. Additionally, there is a notable disparity in performance between open and closed models.
12
 
13
+ In our initial fine-tuning, our goal was to assess how well open models can approach the performance of closed models. By fine-tuning Video LlaVa, we achieved performance levels comparable to those of Claude 3 (Opus).
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Results
 
16
 
17
  | Model | Average | Character and relationship dynamics | Narrative and Plot Analysis | Setting and Technical Analysis | Temporal | Theme Exploration |
18
  |--------------------------------|---------|-------------------------------------|-----------------------------|--------------------------------|----------|-------------------|
 
24
  | Gemini 1.5 Flash | 57.52 | 61.91 | 69.15 | 54.86 | 41.34 | 61.22 |
25
  | Gemini Pro Vision | 50.64 | 54.16 | 65.5 | 46.97 | 35.8 | 58.82 |
26
  | Claude 3 (Opus) | 45.6 | 48.89 | 57.88 | 40.73 | 37.65 | 47.89 |
27
+ | **Video LlaVa - this fine-tune** | **44.16** | **45.26** | **45.14** | **46.93** | **32.55** | **49.47** |
28
  | Video LLaVa | 22.51 | 23.11 | 25.92 | 20.69 | 22.38 | 22.63 |
29
  | mPLUG-Owl | 10.57 | 10.65 | 11.04 | 9.18 | 11.89 | 15.05 |
30
  | Video-ChatGPT | 14.55 | 16.02 | 14.83 | 15.54 | 6.88 | 18.86 |
31
+ | MovieChat | 4.61 | 4.95 | 4.29 | 5.23 | 2.48 | 4.21 |
32
+
33
+
34
+
35
+
36
+ Fine-tuned model taking as bases [Video-LlaVA](https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf) to evaluate its performance on CinePile.
37
+
38
+
39
+
40
+ ## Model Sources
41
+
42
+ <!-- Provide the basic links for the model. -->
43
+
44
+ - **Repository:** [Github](https://github.com/mfarre/Video-LLaVA-7B-hf-CinePile) with fine-tunning and inference notebook.
45
+ ## Uses
46
+
47
+
48
+ Although the model can answer questions based on the content, it is specifically optimized for addressing CinePile-related queries.
49
+ When the questions do not follow a CinePile-specific prompt, the inference section of the notebook is designed to refine and clean up the text produced by the model.
50
+