Post
141
Remember Gemini, GPT-4o, all being true multimodal models ๐.
Now we have a paper ๐ describing an architecture that might achieve that!
Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture ๐๏ธ.
Uni-MoE integrates various modalities (text ๐, image ๐ผ๏ธ, audio ๐ต, video ๐น, speech ๐ฃ๏ธ) using modality-specific encoders and connectors for a cohesive multimodal understanding.
Training Strategy:
1๏ธโฃ Training cross-modality alignment with diverse connectors ๐.
2๏ธโฃ Training modality-specific experts using cross-modality instruction data ๐.
3๏ธโฃTuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data ๐ง.
Technical Details:
Modality-Specific Encoders: CLIP for images ๐ผ๏ธ, Whisper for speech ๐ฃ๏ธ, BEATs for audio ๐ต.
MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation ๐.
Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers ๐ ๏ธ.
Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test ๐.
The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2
Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)
Now we have a paper ๐ describing an architecture that might achieve that!
Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture ๐๏ธ.
Uni-MoE integrates various modalities (text ๐, image ๐ผ๏ธ, audio ๐ต, video ๐น, speech ๐ฃ๏ธ) using modality-specific encoders and connectors for a cohesive multimodal understanding.
Training Strategy:
1๏ธโฃ Training cross-modality alignment with diverse connectors ๐.
2๏ธโฃ Training modality-specific experts using cross-modality instruction data ๐.
3๏ธโฃTuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data ๐ง.
Technical Details:
Modality-Specific Encoders: CLIP for images ๐ผ๏ธ, Whisper for speech ๐ฃ๏ธ, BEATs for audio ๐ต.
MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation ๐.
Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers ๐ ๏ธ.
Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test ๐.
The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2
Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)