Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,8 @@ license: apache-2.0
|
|
6 |
**OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
|
7 |
|
8 |
### ChangeLog
|
9 |
-
|
|
|
10 |
|
11 |
[01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
|
12 |
|
@@ -37,24 +38,25 @@ To further enhance the model's multimodal capabilities, we employ trainable spec
|
|
37 |
|
38 |
### Results
|
39 |
|
40 |
-
OmniFusion was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like
|
41 |
<p align="left">
|
42 |
-
<img src="https://
|
43 |
</p>
|
44 |
|
45 |
-
Model Performance on Visual Dialog Benchmark
|
46 |
-
|
47 |
-
| Model | NDCG | MRR | Recall@1 | Recall@5 | Recall@10 |
|
48 |
-
| ------------ | ---- | ---- | -------- | -------- | --------- |
|
49 |
-
| OmniFusion | 25.91| 10.78| 4.74 | 13.80 | 20.53 |
|
50 |
-
| LLaVA-13B | 24.74| 8.91 | 2.98 | 10.80 | 18.02 |
|
51 |
|
52 |
-
|
53 |
| Model | textvqa| scienceqa | pope | gqa | ok_vqa |
|
54 |
| -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
|
55 |
| OmniFusion-1.1 (one encoder, Mistral) | **0.4893** | **0.6802** | 0.7818 | 0.4600 | 0.5187 |
|
56 |
| OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732 | **0.8153** | **0.4761** | **0.5317** |
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
### Examples
|
59 |
|
60 |
<p align="left">
|
@@ -156,7 +158,7 @@ print(answer)
|
|
156 |
|
157 |
### Future Plans
|
158 |
|
159 |
-
Work is underway on a version that
|
160 |
|
161 |
### Authors
|
162 |
|
|
|
6 |
**OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
|
7 |
|
8 |
### ChangeLog
|
9 |
+
|
10 |
+
[10/04/2024] OmniFusion-1.1 [weights](https://huggingface.co/AIRI-Institute/OmniFusion/tree/main/OmniMistral-v1_1) uploaded. The new model can speak Russian
|
11 |
|
12 |
[01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
|
13 |
|
|
|
38 |
|
39 |
### Results
|
40 |
|
41 |
+
OmniFusion-1.1 was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like Text-VQA.
|
42 |
<p align="left">
|
43 |
+
<img src="https://github.com/AIRI-Institute/OmniFusion/blob/main/content/radar_plot_gigachat.png" width="70%">
|
44 |
</p>
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
+
**OmniFusion-1.1** (Mistral version) results (April, 2024 update):
|
48 |
| Model | textvqa| scienceqa | pope | gqa | ok_vqa |
|
49 |
| -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
|
50 |
| OmniFusion-1.1 (one encoder, Mistral) | **0.4893** | **0.6802** | 0.7818 | 0.4600 | 0.5187 |
|
51 |
| OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732 | **0.8153** | **0.4761** | **0.5317** |
|
52 |
|
53 |
+
OmniFusion-1 (previous Mistral version) Performance on Visual Dialog Benchmark
|
54 |
+
|
55 |
+
| Model | NDCG | MRR | Recall@1 | Recall@5 | Recall@10 |
|
56 |
+
| ------------ | ---- | ---- | -------- | -------- | --------- |
|
57 |
+
| OmniFusion | 25.91| 10.78| 4.74 | 13.80 | 20.53 |
|
58 |
+
| LLaVA-13B | 24.74| 8.91 | 2.98 | 10.80 | 18.02 |
|
59 |
+
|
60 |
### Examples
|
61 |
|
62 |
<p align="left">
|
|
|
158 |
|
159 |
### Future Plans
|
160 |
|
161 |
+
Work is underway on a version that uses ImageBind encoders and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
|
162 |
|
163 |
### Authors
|
164 |
|