Kuldeep Singh Sidhu's picture
2

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

None yet

Organizations

Posts 17

view post
Post
928
Remember Gemini, GPT-4o, all being true multimodal models ๐ŸŒŸ.

Now we have a paper ๐Ÿ“„ describing an architecture that might achieve that!

Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture ๐Ÿ—๏ธ.

Uni-MoE integrates various modalities (text ๐Ÿ“, image ๐Ÿ–ผ๏ธ, audio ๐ŸŽต, video ๐Ÿ“น, speech ๐Ÿ—ฃ๏ธ) using modality-specific encoders and connectors for a cohesive multimodal understanding.

Training Strategy:
1๏ธโƒฃ Training cross-modality alignment with diverse connectors ๐Ÿ”„.
2๏ธโƒฃ Training modality-specific experts using cross-modality instruction data ๐Ÿ“Š.
3๏ธโƒฃTuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data ๐Ÿ”ง.

Technical Details:

Modality-Specific Encoders: CLIP for images ๐Ÿ–ผ๏ธ, Whisper for speech ๐Ÿ—ฃ๏ธ, BEATs for audio ๐ŸŽต.

MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation ๐Ÿš€.

Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers ๐Ÿ› ๏ธ.

Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test ๐Ÿ†.

The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2

Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)
view post
Post
1607
๐Ÿฆ… Falcon has landed... again!
And now it not just reads but sees as well ๐Ÿ“–๐Ÿ‘€

Here is a summary of the Falcon-11B-VLM model:

Model Type: Causal decoder-only model ๐Ÿ”„.

Parameters: 11 billion ๐ŸŒŒ.

Vision Integration: Uses the pretrained CLIP ViT-L/14 vision encoder with the recently released Falcon2-11B chat-finetuned model and trained with image-text data ๐Ÿ–ผ๏ธ๐Ÿ“š.

Training: Pretrained on over 5,000 billion tokens from RefinedWeb with curated corpora ๐Ÿ“Š.

Dynamic Encoding: Enhances perception of fine-grained details in images ๐Ÿ”.

Training Hardware: 16 A100 80GB GPUs with ZeRO and Flash-Attention 2 ๐Ÿ–ฅ๏ธ.

Tokenizer: Falcon-7B/11B tokenizer ๐Ÿงฉ.

Languages Supported: ๐ŸŒ Primarily English, with capabilities in German ๐Ÿ‡ฉ๐Ÿ‡ช, Spanish ๐Ÿ‡ช๐Ÿ‡ธ, French ๐Ÿ‡ซ๐Ÿ‡ท, Italian ๐Ÿ‡ฎ๐Ÿ‡น, Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ, Romanian ๐Ÿ‡ท๐Ÿ‡ด, Czech ๐Ÿ‡จ๐Ÿ‡ฟ, Swedish ๐Ÿ‡ธ๐Ÿ‡ช, and more. ๐Ÿ—ฃ๏ธ๐ŸŒ.

License: Open Source - TII Falcon License 2.0, based on Apache 2.0 ๐Ÿ“œ.

Model: tiiuae/falcon-11B-vlm

models

None public yet

datasets

None public yet