File size: 4,452 Bytes
7cd35ec 983aa94 7cd35ec 66c0378 67158dc eaace90 67158dc 41e97dc 67158dc b981912 67158dc 48e03de 67158dc 6ba9de8 67158dc 0690459 4dc33e2 67158dc 983aa94 67158dc 193ff03 eaace90 98e17db eaace90 ccf6bc4 48e03de 4d53b21 6b9d45c 67158dc ed49dcc 67158dc eaace90 67158dc f02496e 67158dc eaace90 a05f9c5 a5f52e5 56f0ba6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
tags:
- RUDOLPH
- text-image
- image-text
- decoder
datasets:
- sberquad
---
# RUDOLPH-2.7B-FBC2 (XL)
RUDOLPH: One Hyper-Tasking Transformer Can be Creative as DALL-E and GPT-3 and Smart as CLIP
<img src="https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/RUDOLPH.png" width=60% border="2"/>
This is a fine-tuned version of the pre-trained [RUDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RUDOLPH-2.7B). Model was trained by [Sber AI](https://github.com/ai-forever) and [AIRI](https://airi.net) teams.
# Model Description
**RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-tasking (**RUDOLPH**) **2.7B** is the largest text-image-text transformer designed for an easy fine-tuning for a range of tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-tasking Transformers.
*Hyper-tasking model is a generalized multi-tasking model, i.e., the model that can solve almost all tasks within supported modalities, mandatory including mutual pairwise translations between modalities (two modalities in case of RUDOLPH: images and Russian texts).*
* Tasks: ` text2image generation, self reranking, text ranking, image ranking, image2text generation, zero-shot image classification, text2text generation, text qa, math qa, image captioning, image generation, text recognition in the wild, visual qa, and so on`
* Language: ` Russian`
* Type: ` decoder`
* Num Parameters: ` 2.7B`
* Training Data Volume: ` 119 million text-image pairs, 60 million text paragraphs`
* Fine-tuning Data Volume: ` 43 334 text question-answer pairs, 100 000 math tasks, 85 000 text-image pairs (for captioning, generation), 85 759 visual question-answer pairs, 140 000 image-text pairs for text recognition`
The model was prepared as a baseline for FusionBrain Challenge 2.0 (as a part of AI Journey Contest 2022) and is a fine-tuned version of the pre-trained [RuDOLPH 2.7B model](https://huggingface.co/sberbank-ai/RUDOLPH-2.7B) using 6 tasks:
* Text QA: on [SberQUaD dataset](https://huggingface.co/datasets/sberquad).
* Math QA: on [DeepMind Mathematics Dataset](https://github.com/deepmind/mathematics_dataset).
* Image Captioning: on [COCO dataset](https://cocodataset.org/#home) translated into Russian (MT).
* Image Generation: on [COCO dataset](https://cocodataset.org/#home) translated into Russian (MT).
* VQA: on [COCO dataset](https://cocodataset.org/#home) with prepared question set.
* Text Recognition in the Wild: on [START](https://n-ws-f21jf.s3pd02.sbercloud.ru/b-ws-f21jf-ny6/FBC2/titw_dataset.zip) dataset (**S**yn**T**hesized and **A**nnotated dataset for **T**ext **R**ecognition) consisting of synthetic and real-world human-annotated data for text recognition task.
# Details of architecture
<img src=https://raw.githubusercontent.com/ai-forever/ru-dolph/master/pics/scheme-rudolph_27B.jpg height="20" border="2"/>
The maximum sequence length that this model may be used with depends on the modality and stands for 384 - 576 - 128 for the left text tokens, image tokens, and right text tokens, respectively.
RUDOLPH 2.7B is a Transformer-based decoder model with the following parameters:
* num\_layers (32) — Number of hidden layers in the Transformer decoder.
* hidden\_size (2560) — Dimensionality of the hidden layers.
* num\_attention\_heads (32) — Number of attention heads for each attention layer.
# Sparse Attention Masks
The primary proposed method is to modify the sparse transformer's attention mask to better control modalities. It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.
<img src="https://raw.githubusercontent.com/lizagonch/ru-dolph/develop_v1/pics/attention_masks_2700m.png" height="20" border="2"/>
# Authors
+ Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
+ Nastya Maltseva: [Github](https://github.com/NastyaMittseva)
+ Liza Goncharova: [Github](https://github.com/lizagonch)
+ Andrey Kuznetsov: [Github](https://github.com/kuznetsoffandrey)
+ Denis Dimitrov: [Github](https://github.com/denndimitrov) |