Safetensors
matveymih commited on
Commit
8f23f2d
1 Parent(s): ff8cb47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -11
README.md CHANGED
@@ -6,7 +6,8 @@ license: apache-2.0
6
  **OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
7
 
8
  ### ChangeLog
9
- [01/04/2024] OmniFusion-1.1 weights are uploaded to [Huggingface](https://huggingface.co/AIRI-Institute/OmniFusion/tree/main/OmniMistral-v1_1). Now the model can speak Russian :)
 
10
 
11
  [01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
12
 
@@ -37,24 +38,25 @@ To further enhance the model's multimodal capabilities, we employ trainable spec
37
 
38
  ### Results
39
 
40
- OmniFusion was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like VisualDialog.
41
  <p align="left">
42
- <img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/radar.png" width="70%">
43
  </p>
44
 
45
- Model Performance on Visual Dialog Benchmark
46
-
47
- | Model | NDCG | MRR | Recall@1 | Recall@5 | Recall@10 |
48
- | ------------ | ---- | ---- | -------- | -------- | --------- |
49
- | OmniFusion | 25.91| 10.78| 4.74 | 13.80 | 20.53 |
50
- | LLaVA-13B | 24.74| 8.91 | 2.98 | 10.80 | 18.02 |
51
 
52
- Update (April, 2024): OmniFusion-1.1 results:
53
  | Model | textvqa| scienceqa | pope | gqa | ok_vqa |
54
  | -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
55
  | OmniFusion-1.1 (one encoder, Mistral) | **0.4893** | **0.6802** | 0.7818 | 0.4600 | 0.5187 |
56
  | OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732 | **0.8153** | **0.4761** | **0.5317** |
57
 
 
 
 
 
 
 
 
58
  ### Examples
59
 
60
  <p align="left">
@@ -156,7 +158,7 @@ print(answer)
156
 
157
  ### Future Plans
158
 
159
- Work is underway on a version that understands Russian, uses ImageBind encoders, and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
160
 
161
  ### Authors
162
 
 
6
  **OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
7
 
8
  ### ChangeLog
9
+
10
+ [10/04/2024] OmniFusion-1.1 [weights](https://huggingface.co/AIRI-Institute/OmniFusion/tree/main/OmniMistral-v1_1) uploaded. The new model can speak Russian
11
 
12
  [01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
13
 
 
38
 
39
  ### Results
40
 
41
+ OmniFusion-1.1 was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like Text-VQA.
42
  <p align="left">
43
+ <img src="https://github.com/AIRI-Institute/OmniFusion/blob/main/content/radar_plot_gigachat.png" width="70%">
44
  </p>
45
 
 
 
 
 
 
 
46
 
47
+ **OmniFusion-1.1** (Mistral version) results (April, 2024 update):
48
  | Model | textvqa| scienceqa | pope | gqa | ok_vqa |
49
  | -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
50
  | OmniFusion-1.1 (one encoder, Mistral) | **0.4893** | **0.6802** | 0.7818 | 0.4600 | 0.5187 |
51
  | OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732 | **0.8153** | **0.4761** | **0.5317** |
52
 
53
+ OmniFusion-1 (previous Mistral version) Performance on Visual Dialog Benchmark
54
+
55
+ | Model | NDCG | MRR | Recall@1 | Recall@5 | Recall@10 |
56
+ | ------------ | ---- | ---- | -------- | -------- | --------- |
57
+ | OmniFusion | 25.91| 10.78| 4.74 | 13.80 | 20.53 |
58
+ | LLaVA-13B | 24.74| 8.91 | 2.98 | 10.80 | 18.02 |
59
+
60
  ### Examples
61
 
62
  <p align="left">
 
158
 
159
  ### Future Plans
160
 
161
+ Work is underway on a version that uses ImageBind encoders and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
162
 
163
  ### Authors
164