singhsidhukuldeep (Kuldeep Singh Sidhu)

Posts 18

Post

288

🔥 77.2% on MMLU with 3.7B parameters 🚀

... 3.7B active parameters, 40B in total parameters 📊

7.4 GFlops forward computation per token, 1/19 of Llama3-70B 📉

Exciting enough? 😲

That's Yuan2-M32 for you, released by IEIT-Yuan.
A new 40B Mixture of Experts using a new Attention Router mechanism 🧠

32 experts with 2 active in generation ✌️

8,192 context length 📝

Trained on 2T tokens, using 9.25% of the compute required by the dense models 🛠️.

Yuan 2.0-M32 employs fine-tuning techniques to adjust to longer sequence lengths, utilizing a modified base value in the Rotary Position Embedding to maintain performance over extended contexts 🔄.

Open-source - Apache 2.0 📜

Vocabulary size of 135,040 🗣️

Outperforms Mixtral 8x7B (47B total parameters, 12.9B active parameters) on all benchmarks and almost gives Llama 3 70B run for its money 💸

Models: https://huggingface.co/IEITYuan 🌐
Paper: Yuan 2.0-M32: Mixture of Experts with Attention Router (2405.17976) 📄

Post

1169

Remember Gemini, GPT-4o, all being true multimodal models 🌟.

Now we have a paper 📄 describing an architecture that might achieve that!

Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture 🏗️.

Uni-MoE integrates various modalities (text 📝, image 🖼️, audio 🎵, video 📹, speech 🗣️) using modality-specific encoders and connectors for a cohesive multimodal understanding.

Training Strategy:
1️⃣ Training cross-modality alignment with diverse connectors 🔄.
2️⃣ Training modality-specific experts using cross-modality instruction data 📊.
3️⃣Tuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data 🔧.

Technical Details:

Modality-Specific Encoders: CLIP for images 🖼️, Whisper for speech 🗣️, BEATs for audio 🎵.

MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation 🚀.

Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers 🛠️.

Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test 🏆.

The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2

Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)

View all posts

models

None public yet

datasets

None public yet

Kuldeep Singh Sidhu

AI & ML interests

Organizations

Posts 18

models

datasets