sberbank-ai commited on
Commit
eaace90
β€’
1 Parent(s): fade180

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -19
README.md CHANGED
@@ -1,41 +1,47 @@
1
  # RUDOLPH-2.7B (XL)
2
 
3
- RUDOLPH: One Hyper-Tasking Transformer can be creative as DALL-E and smart as CLIP
4
 
5
- <img src="https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/RUDOLPH.png" height="60" border="2"/>
6
 
7
- Model was trained by [Sber AI](https://github.com/ai-forever) and [AIRI](https://airi.net) teams.
8
- * Task: `text2image generation`; `self reranking`; `text ranking`; `image ranking`; `image2text generation`; `zero-shot image classification`, `text2text generation`; `text-qa`; 'math-qa'; `image captioning`; `image generation`; `text-in-the-wild`; `vqa`;
9
- * Language: `Russian`
10
- * Type: `decoder`
11
- * Num Parameters: `2.7B`
12
- * Training Data Volume: `119 million text-image pairs; 60 million text paragraphs`
13
- * Fine-tuning Data Volume: `43 334 text question-answer pairs; 100 000 math tasks; 85 000 text-image pairs (for captioning, generation); 85 759 visual question-answer pairs; 140 000 image-text pairs for text recognition`
14
 
15
 
16
  # Model Description
17
 
18
- **RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-Tasking (RUDOLPH) 2.7B is the largest text-image-text transformer designed for an easy fine-tuning setup for the solution of various tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-modality Transformers.
19
 
20
- *(!!!) Hyper-Tasking means generalized Multi-Tasking, e.g., the model that can solve almost all tasks within supported modalities (two modalities in case of RUDOLPH: images and Russian texts).
21
 
22
- This is a fine-tuned version of the pre-trained RuDOLPH 2.7B model.
 
 
 
 
 
23
 
24
- The model was prepared as a baseline for AI Journey 2022 (AIJ2) fine-tuned using 6 tasks:
25
 
26
  * Text QA – SberQUaD dataset.
27
  * Math QA – DeepMind Mathematics Dataset.
28
- * Captioning – COCO dataset.
 
29
  * VQA – COCO dataset with prepared question set.
30
- * Generation – COCO dataset.
31
- * Text-in-the-wild – synthesized data.
 
 
 
 
32
 
33
- # Sparse Attention Mask
34
 
35
- The primary proposed method is to modify the sparse transformer's attention mask to better control multi-modalities and up to the next level with "hyper-modality". It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.
36
 
37
  ![rudolph27b_masks.png](https://s3.amazonaws.com/moonup/production/uploads/1663662426135-5f91b1208a61a359f44e1851.png)
38
 
39
  # Authors
40
 
41
- + Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
 
 
 
1
  # RUDOLPH-2.7B (XL)
2
 
3
+ RUDOLPH: One Hyper-Tasking Transformer Can be Creative as DALL-E and GPT-3 and Smart as CLIP
4
 
5
+ <img src="https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/RUDOLPH.png" height="50" border="2"/>
6
 
7
+ This is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B). Model was trained by [Sber AI](https://github.com/ai-forever) and [AIRI](https://airi.net) teams.
 
 
 
 
 
 
8
 
9
 
10
  # Model Description
11
 
12
+ **RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-tasking (RUDOLPH) 2.7B is the largest text-image-text transformer designed for an easy fine-tuning for a number of tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-tasking Transformers.
13
 
14
+ *Hyper-tasking means generalized multi-tasking, e.g., the model that can solve almost all tasks within supported modalities (two modalities in case of RUDOLPH: images and Russian texts).*
15
 
16
+ * Tasks: `text2image generation`, `self reranking`, `text ranking`, `image ranking`, `image2text generation`, `zero-shot image classification`, `text2text generation`, `text-qa`, `math-qa`, `image captioning`, `image generation`, `text-in-the-wild`, `vqa`, and so on
17
+ * Language: `Russian`
18
+ * Type: `decoder`
19
+ * Num Parameters: `2.7B`
20
+ * Training Data Volume: `119 million text-image pairs`, `60 million text paragraphs`
21
+ * Fine-tuning Data Volume: `43 334 text question-answer pairs`, `100 000 math tasks`, `85 000 text-image pairs (for captioning, generation)`, `85 759 visual question-answer pairs`, `140 000 image-text pairs for text recognition`
22
 
23
+ The model was prepared as a baseline for FusionBrain Challenge 2.0 (as a part of AI Journey Contest 2022) and is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B) using 6 tasks:
24
 
25
  * Text QA – SberQUaD dataset.
26
  * Math QA – DeepMind Mathematics Dataset.
27
+ * Image Captioning – COCO dataset (with automated translation).
28
+ * Image Generation – COCO dataset (with automated translation).
29
  * VQA – COCO dataset with prepared question set.
30
+ * Text Recognition in the Wild
31
+
32
+ # Details of architecture.
33
+
34
+ ## Parameters
35
+
36
 
37
+ ## Sparse Attention Mask
38
 
39
+ The primary proposed method is to modify the sparse transformer's attention mask to better control modalities. It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.
40
 
41
  ![rudolph27b_masks.png](https://s3.amazonaws.com/moonup/production/uploads/1663662426135-5f91b1208a61a359f44e1851.png)
42
 
43
  # Authors
44
 
45
+ + Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
46
+ + Liza Goncharova:
47
+ + Nastia Maltseva: