LLaVA-NeXT is recently merged to Hugging Face transformers and it outperforms many of the closed source models like Gemini on various benchmarks 🤩 Let's take a look! Demo: merve/llava-next Notebook: https://colab.research.google.com/drive/1afNudu72SNWZCYtCVrRlb9T9Vj9CFJEK?usp=sharing LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨ LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs: - Nous-Hermes-Yi-34B - Mistral-7B - Vicuna 7B & 13B Mistral and Nous-Hermes-Yi-34B are performing better and have better commercial use. Moreover, according to authors' findings, the improvements comes from more diverse and high quality data mixture and dynamic high resolution. LLaVA based on Nous-Hermes-Yi-34B outperforms many other models, including Gemini in various multimodal understanding and generation benchmarks 😊