Post
2462
NVIDIA just dropped a gigantic multimodal model called NVLM 72B π¦
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)
The paper contains many ablation studies on various ways to use the LLM backbone ππ»
𦩠Flamingo-like cross-attention (NVLM-X)
π Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
β¨ a hybrid architecture (NVLM-H)
Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models π
The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets
You can easily use this model by loading it through transformers' AutoModel π
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)
The paper contains many ablation studies on various ways to use the LLM backbone ππ»
𦩠Flamingo-like cross-attention (NVLM-X)
π Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
β¨ a hybrid architecture (NVLM-H)
Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models π
The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets
You can easily use this model by loading it through transformers' AutoModel π