vision_papers

Runtime error

App Files Files Community

vision_papers / pages /23_Video-LLaVA.py

lbourdois

Upload 174 files

94e735e verified 7 months ago

raw

history blame

2.63 kB

	import streamlit as st
	from streamlit_extras.switch_page_button import switch_page

	st.title("Video-LLaVA")

	st.success("""[Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024)""", icon="ℹ️")
	st.markdown(""" """)

	st.markdown("""We have recently merged Video-LLaVA to 🤗 Transformers! 🎞️
	What makes this model different? Keep reading ⇊
	""")
	st.markdown(""" """)

	st.video("pages/Video-LLaVA/video_1.mp4", format="video/mp4")
	st.markdown(""" """)

	st.markdown("""[Demo](https://t.co/MVP14uEj9e) \| [Model](https://t.co/oqSCMUqwJo)
	See below how to initialize the model and processor and infer ⬇️
	""")
	st.markdown(""" """)

	st.image("pages/Video-LLaVA/image_1.jpg", use_column_width=True)
	st.markdown(""" """)

	st.markdown("""
	Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.
	""")
	st.markdown(""" """)

	st.image("pages/Video-LLaVA/image_2.jpg", use_column_width=True)
	st.markdown(""" """)

	st.markdown("""
	It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.
	""")
	st.markdown(""" """)

	st.image("pages/Video-LLaVA/image_3.jpg", use_column_width=True)
	st.markdown(""" """)

	st.markdown("""
	I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models.
	It's a relatively older model but ahead of it's time and works very well!
	""")
	st.markdown(""" """)

	st.image("pages/Video-LLaVA/image_4.jpg", use_column_width=True)
	st.markdown(""" """)

	st.info("""
	Ressources:
	[Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122)
	by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023)
	[GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA)
	[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava)
	""", icon="📚")

	st.markdown(""" """)
	st.markdown(""" """)
	st.markdown(""" """)
	col1, col2, col3 = st.columns(3)
	with col1:
	if st.button('Previous paper', use_container_width=True):
	switch_page("Chameleon")
	with col2:
	if st.button('Home', use_container_width=True):
	switch_page("Home")
	with col3:
	if st.button('Next paper', use_container_width=True):
	switch_page("SAMv2")