singhsidhukuldeep (Kuldeep Singh Sidhu)

Posts 17

Post

928

Remember Gemini, GPT-4o, all being true multimodal models 🌟.

Now we have a paper 📄 describing an architecture that might achieve that!

Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture 🏗️.

Uni-MoE integrates various modalities (text 📝, image 🖼️, audio 🎵, video 📹, speech 🗣️) using modality-specific encoders and connectors for a cohesive multimodal understanding.

Training Strategy:
1️⃣ Training cross-modality alignment with diverse connectors 🔄.
2️⃣ Training modality-specific experts using cross-modality instruction data 📊.
3️⃣Tuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data 🔧.

Technical Details:

Modality-Specific Encoders: CLIP for images 🖼️, Whisper for speech 🗣️, BEATs for audio 🎵.

MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation 🚀.

Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers 🛠️.

Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test 🏆.

The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2

Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)

Post

1607

🦅 Falcon has landed... again!
And now it not just reads but sees as well 📖👀

Here is a summary of the Falcon-11B-VLM model:

Model Type: Causal decoder-only model 🔄.

Parameters: 11 billion 🌌.

Vision Integration: Uses the pretrained CLIP ViT-L/14 vision encoder with the recently released Falcon2-11B chat-finetuned model and trained with image-text data 🖼️📚.

Training: Pretrained on over 5,000 billion tokens from RefinedWeb with curated corpora 📊.

Dynamic Encoding: Enhances perception of fine-grained details in images 🔍.

Training Hardware: 16 A100 80GB GPUs with ZeRO and Flash-Attention 2 🖥️.

Tokenizer: Falcon-7B/11B tokenizer 🧩.

Languages Supported: 🌍 Primarily English, with capabilities in German 🇩🇪, Spanish 🇪🇸, French 🇫🇷, Italian 🇮🇹, Dutch 🇳🇱, Romanian 🇷🇴, Czech 🇨🇿, Swedish 🇸🇪, and more. 🗣️🌐.

License: Open Source - TII Falcon License 2.0, based on Apache 2.0 📜.

Model: tiiuae/falcon-11B-vlm

View all posts

models

None public yet

datasets

None public yet

Kuldeep Singh Sidhu

AI & ML interests

Organizations

Posts 17

models

datasets