sberbank-ai commited on
Commit
48e03de
β€’
1 Parent(s): 79530f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -11
README.md CHANGED
@@ -6,10 +6,9 @@ RUDOLPH: One Hyper-Tasking Transformer Can be Creative as DALL-E and GPT-3 and S
6
 
7
  This is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B). Model was trained by [Sber AI](https://github.com/ai-forever) and [AIRI](https://airi.net) teams.
8
 
9
-
10
  # Model Description
11
 
12
- **RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-tasking (**RUDOLPH**) **2.7B** is the largest text-image-text transformer designed for an easy fine-tuning for a number of tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-tasking Transformers.
13
 
14
  *Hyper-tasking means generalized multi-tasking, e.g., the model that can solve almost all tasks within supported modalities (two modalities in case of RUDOLPH: images and Russian texts).*
15
 
@@ -22,26 +21,35 @@ This is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://hug
22
 
23
  The model was prepared as a baseline for FusionBrain Challenge 2.0 (as a part of AI Journey Contest 2022) and is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B) using 6 tasks:
24
 
25
- * Text QA – SberQUaD dataset.
26
- * Math QA – DeepMind Mathematics Dataset.
27
- * Image Captioning – COCO dataset (with automated translation).
28
- * Image Generation – COCO dataset (with automated translation).
29
- * VQA – COCO dataset with prepared question set.
30
- * Text Recognition in the Wild
31
 
32
  # Details of architecture
33
 
34
  ### Parameters
35
 
 
 
 
 
 
 
 
 
 
36
 
37
  ### Sparse Attention Mask
38
 
39
  The primary proposed method is to modify the sparse transformer's attention mask to better control modalities. It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.
40
 
41
- ![rudolph27b_masks.png](https://s3.amazonaws.com/moonup/production/uploads/1663662426135-5f91b1208a61a359f44e1851.png)
42
 
43
  # Authors
44
 
45
  + Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
46
- + Liza Goncharova:
47
- + Nastia Maltseva:
 
6
 
7
  This is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B). Model was trained by [Sber AI](https://github.com/ai-forever) and [AIRI](https://airi.net) teams.
8
 
 
9
  # Model Description
10
 
11
+ **RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-tasking (**RUDOLPH**) **2.7B** is the largest text-image-text transformer designed for an easy fine-tuning for a range of tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-tasking Transformers.
12
 
13
  *Hyper-tasking means generalized multi-tasking, e.g., the model that can solve almost all tasks within supported modalities (two modalities in case of RUDOLPH: images and Russian texts).*
14
 
 
21
 
22
  The model was prepared as a baseline for FusionBrain Challenge 2.0 (as a part of AI Journey Contest 2022) and is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RuDOLPH-2.7B) using 6 tasks:
23
 
24
+ * Text QA – [SberQUaD dataset](https://huggingface.co/datasets/sberquad).
25
+ * Math QA – [DeepMind Mathematics Dataset](https://github.com/deepmind/mathematics_dataset).
26
+ * Image Captioning – [COCO dataset](https://cocodataset.org/#home) (with automated translation).
27
+ * Image Generation – [COCO dataset](https://cocodataset.org/#home) (with automated translation).
28
+ * VQA12-layer, 768-hidden, 12-heads, 110M parameters.[COCO dataset](https://cocodataset.org/#home) with prepared question set.
29
+ * Text Recognition in the Wild – the dataset consisting of synthetic and real-world human-annotated data for text recognition task.
30
 
31
  # Details of architecture
32
 
33
  ### Parameters
34
 
35
+ ![rudolph27b_masks.png](https://s3.amazonaws.com/moonup/production/uploads/1663662426135-5f91b1208a61a359f44e1851.png)
36
+
37
+ The maximum sequence length that this model may be used with depends on the modality and stands for 384 - 576 - 128 for the left text tokens, image tokens, and right text tokens, respectively.
38
+
39
+ RUDOLPH 2.7B is a Transformer-based decoder model with the following parameters:
40
+
41
+ * num-layers (32) β€” Number of hidden layers in the Transformer decoder.
42
+ * hidden-size (2560) β€” Dimensionality of the hidden layers.
43
+ * num_attention_heads (32) β€” Number of attention heads for each attention layer.
44
 
45
  ### Sparse Attention Mask
46
 
47
  The primary proposed method is to modify the sparse transformer's attention mask to better control modalities. It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.
48
 
49
+ <img src=https://github.com/lizagonch/ru-dolph-fbc2/blob/develop_v1/pics/scheme-rudolph_27B.jpg>
50
 
51
  # Authors
52
 
53
  + Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
54
+ + Liza Goncharova: [Github](https://github.com/lizagonch)
55
+ + Nastya Maltseva: [Github](https://github.com/NastyaMittseva)