File size: 3,215 Bytes
94e735e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import streamlit as st
from streamlit_extras.switch_page_button import switch_page

st.title("Llava-NeXT-Interleave")

st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1813560292397203630) (July 17, 2024)""", icon="ℹ️")
st.markdown(""" """)

st.markdown("""The vision language model in this video is 0.5B and can take in image, video and 3D! 🤯   
Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data keep reading ⥥⥥  
""")
st.markdown(""" """)

st.video("pages/Llava-NeXT-Interleave/video_1.mp4", format="video/mp4")
st.markdown(""" """)

st.markdown("""This model comes with 0.5B, 7B and 7B-DPO variants, all can be used with Transformers 😍  
[Collection of models](https://t.co/sZsaglSXa3) | [Demo](https://t.co/FbpaMWJY8k)  
See how to use below 👇🏻  
""")
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_1.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Authors of this paper have explored training <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a> on interleaved data where the data consists of multiple modalities, including image(s), video, 3D 📚  
They have discovered that interleaved data increases results across all benchmarks! 
""", unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_2.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
The model can do task transfer from single image tasks to multiple images 🤯  
The authors have trained the model on single images and code yet the model can solve coding with multiple images.  
""")
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_3.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Same applies to other modalities, see below for video:
""")
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_4.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
The model also has document understanding capabilities and many real-world application areas.
""")
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_5.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
This release also comes with the dataset this model was fine-tuned on 📖 [M4-Instruct-Data](https://t.co/rutXMtNC0I)
""")
st.markdown(""" """)

st.image("pages/Llava-NeXT-Interleave/image_6.jpg", use_column_width=True)
st.markdown(""" """)

st.info("""
Ressources:  
[LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/) 
by Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li (2024)  
[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Interleave.md)""", icon="📚")

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
    if st.button('Previous paper', use_container_width=True):
        switch_page("RT-DETR")
with col2:
    if st.button('Home', use_container_width=True):
        switch_page("Home")
with col3:
    if st.button('Next paper', use_container_width=True):
        switch_page("Chameleon")