Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("Video-LLaVA") | |
st.success("""[Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024)""", icon="ℹ️") | |
st.markdown(""" """) | |
st.markdown("""We have recently merged Video-LLaVA to 🤗 Transformers! 🎞️ | |
What makes this model different? Keep reading ⇊ | |
""") | |
st.markdown(""" """) | |
st.video("pages/Video-LLaVA/video_1.mp4", format="video/mp4") | |
st.markdown(""" """) | |
st.markdown("""[Demo](https://t.co/MVP14uEj9e) | [Model](https://t.co/oqSCMUqwJo) | |
See below how to initialize the model and processor and infer ⬇️ | |
""") | |
st.markdown(""" """) | |
st.image("pages/Video-LLaVA/image_1.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Video-LLaVA/image_2.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Video-LLaVA/image_3.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models. | |
It's a relatively older model but ahead of it's time and works very well! | |
""") | |
st.markdown(""" """) | |
st.image("pages/Video-LLaVA/image_4.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.info(""" | |
Ressources: | |
[Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122) | |
by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023) | |
[GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA) | |
[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava) | |
""", icon="📚") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("Chameleon") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("SAMv2") |