AIRI-Institute
/

OmniFusion

Safetensors

Model card Files Files and versions Community

matveymih commited on Apr 10, 2024

Commit

8f23f2d

verified ·

1 Parent(s): ff8cb47

Update README.md

Browse files

Files changed (1) hide show

README.md +13 -11

README.md CHANGED Viewed

@@ -6,7 +6,8 @@ license: apache-2.0
 **OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
 ### ChangeLog
-[01/04/2024] OmniFusion-1.1 weights are uploaded to [Huggingface](https://huggingface.co/AIRI-Institute/OmniFusion/tree/main/OmniMistral-v1_1). Now the model can speak Russian :)
 [01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
@@ -37,24 +38,25 @@ To further enhance the model's multimodal capabilities, we employ trainable spec
 ### Results
-OmniFusion was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like VisualDialog.
 <p align="left">
-<img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/radar.png" width="70%">
 </p>
-Model Performance on Visual Dialog Benchmark
-| Model        | NDCG | MRR  | Recall@1 | Recall@5 | Recall@10 |
-| ------------ | ---- | ---- | -------- | -------- | --------- |
-| OmniFusion   | 25.91| 10.78| 4.74     | 13.80    | 20.53     |
-| LLaVA-13B    | 24.74| 8.91 | 2.98     | 10.80    | 18.02     |
-Update (April, 2024): OmniFusion-1.1 results:
 | Model                                  | textvqa| scienceqa  | pope      | gqa      | ok_vqa  |
 | -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
 | OmniFusion-1.1 (one encoder, Mistral)  | **0.4893** | **0.6802**     | 0.7818    | 0.4600   | 0.5187  |
 | OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732     | **0.8153**    | **0.4761**   | **0.5317**  |
 ### Examples
 <p align="left">
@@ -156,7 +158,7 @@ print(answer)
 ### Future Plans
-Work is underway on a version that understands Russian, uses ImageBind encoders, and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
 ### Authors

 **OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
 ### ChangeLog
+[10/04/2024] OmniFusion-1.1 [weights](https://huggingface.co/AIRI-Institute/OmniFusion/tree/main/OmniMistral-v1_1) uploaded. The new model can speak Russian
 [01/04/2024] Model training [source code](https://github.com/AIRI-Institute/OmniFusion/tree/main/OmniFusion/train_src) for OmniFusion-1.1 released
 ### Results
+OmniFusion-1.1 was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like Text-VQA.
 <p align="left">
+<img src="https://github.com/AIRI-Institute/OmniFusion/blob/main/content/radar_plot_gigachat.png" width="70%">
 </p>
+**OmniFusion-1.1** (Mistral version) results (April, 2024 update):
 | Model                                  | textvqa| scienceqa  | pope      | gqa      | ok_vqa  |
 | -------------------------------------- | ------ | ---------- | --------- | -------- | ------- |
 | OmniFusion-1.1 (one encoder, Mistral)  | **0.4893** | **0.6802**     | 0.7818    | 0.4600   | 0.5187  |
 | OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732     | **0.8153**    | **0.4761**   | **0.5317**  |
+OmniFusion-1 (previous Mistral version) Performance on Visual Dialog Benchmark
+| Model        | NDCG | MRR  | Recall@1 | Recall@5 | Recall@10 |
+| ------------ | ---- | ---- | -------- | -------- | --------- |
+| OmniFusion   | 25.91| 10.78| 4.74     | 13.80    | 20.53     |
+| LLaVA-13B    | 24.74| 8.91 | 2.98     | 10.80    | 18.02     |
 ### Examples
 <p align="left">
 ### Future Plans
+Work is underway on a version that uses ImageBind encoders and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
 ### Authors