vision_papers

Runtime error

App Files Files Community

lbourdois commited on Aug 5, 2024

Commit

94e735e

verified ·

1 Parent(s): c7dc753

Upload 174 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
Home.py +16 -0
README.md +4 -4
pages/10_Painter.py +53 -0
pages/11_SegGPT.py +70 -0
pages/12_Grounding_DINO.py +92 -0
pages/13_DocOwl_1.5.py +100 -0
pages/14_PLLaVA.py +65 -0
pages/15_CuMo.py +61 -0
pages/16_DenseConnector.py +69 -0
pages/17_Depth_Anything_V2.py +74 -0
pages/18_Florence-2.py +78 -0
pages/19_4M-21.py +70 -0
pages/1_MobileSAM.py +79 -0
pages/20_RT-DETR.py +67 -0
pages/21_Llava-NeXT-Interleave.py +86 -0
pages/22_Chameleon.py +88 -0
pages/23_Video-LLaVA.py +70 -0
pages/24_SAMv2.py +88 -0
pages/2_Oneformer.py +62 -0
pages/3_VITMAE.py +63 -0
pages/4M-21/4M-21.md +32 -0
pages/4M-21/image_1.jpg +0 -0
pages/4M-21/image_2.jpg +0 -0
pages/4M-21/image_3.jpg +0 -0
pages/4M-21/video_1.mp4 +3 -0
pages/4M-21/video_2.mp4 +0 -0
pages/4_DINOv2.py +78 -0
pages/5_SigLIP.py +78 -0
pages/6_OWLv2.py +87 -0
pages/7_Backbone.py +63 -0
pages/8_Depth_Anything.py +100 -0
pages/9_LLaVA-NeXT.py +74 -0
pages/Backbone/Backbone.md +31 -0
pages/Backbone/image_1.jpeg +0 -0
pages/Backbone/image_2.jpeg +0 -0
pages/Backbone/image_3.jpeg +0 -0
pages/Backbone/image_4.jpeg +0 -0
pages/Chameleon/Chameleon.md +43 -0
pages/Chameleon/image_1.jpg +0 -0
pages/Chameleon/image_2.jpg +0 -0
pages/Chameleon/image_3.jpg +0 -0
pages/Chameleon/image_4.jpg +0 -0
pages/Chameleon/image_5.jpg +0 -0
pages/Chameleon/image_6.jpg +0 -0
pages/Chameleon/image_7.jpg +0 -0
pages/Chameleon/video_1.mp4 +0 -0
pages/CuMo/CuMo.md +24 -0
pages/CuMo/image_1.jpg +0 -0
pages/CuMo/image_2.jpg +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+pages/4M-21/video_1.mp4 filter=lfs diff=lfs merge=lfs -text
+pages/Depth[[:space:]]Anything/video_1.mp4 filter=lfs diff=lfs merge=lfs -text
+pages/RT-DETR/video_1.mp4 filter=lfs diff=lfs merge=lfs -text

Home.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import streamlit as st
+st.set_page_config(page_title="Home",page_icon="🏠")
+# st.image("image_of_a_Turkish_lofi_girl_sitting_at_a_desk_writing_summaries_of_scientific_publications_ghibli_anime_like_hd.jpeg", use_column_width=True)
+st.write("# Vision Papers 📚")
+st.markdown(
+    """
+    I've created a simple Streamlit App where I list summaries of papers (my browser bookmarks or Twitter bookmarks were getting messy).
+    Since you're one of my sources for bibliography, I thought you might be interested in having all your summaries grouped together somewhere
+    (average of 0.73 summaries per week, I don't know what it's your fuel but that's impressive).
+    """
+)

README.md CHANGED Viewed

@@ -1,11 +1,11 @@
 ---
 title: Vision Papers
-emoji: 📊
-colorFrom: yellow
-colorTo: indigo
 sdk: streamlit
 sdk_version: 1.37.0
-app_file: app.py
 pinned: false
 ---

 ---
 title: Vision Papers
+emoji: 💻
+colorFrom: indigo
+colorTo: blue
 sdk: streamlit
 sdk_version: 1.37.0
+app_file: Home.py
 pinned: false
 ---

pages/10_Painter.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Painter")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1771542172946354643) (March 23, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""I read the Painter [paper](https://t.co/r3aHp29mjf) by [BAAIBeijing](https://x.com/BAAIBeijing) to convert the weights to 🤗 Transformers, and I absolutely loved the approach they took so I wanted to take time to unfold it here!
+""")
+st.markdown(""" """)
+st.image("pages/Painter/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""So essentially this model takes inspiration from in-context learning, as in, in LLMs you give an example input output and give the actual input that you want model to complete (one-shot learning) they adapted this to images, thus the name "images speak in images".
+This model doesn't have any multimodal parts, it just has an image encoder and a decoder head (linear layer, conv layer, another linear layer) so it's a single modality.
+The magic sauce is the data: they input the task in the form of image and associated transformation and another image they want the transformation to take place and take smooth L2 loss over the predictions and ground truth this is like T5 of image models 😀
+""")
+st.markdown(""" """)
+st.image("pages/Painter/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""What is so cool about it is that it can actually adapt to out of domain tasks, meaning, in below chart, it was trained on the tasks above the dashed line, and the authors found out it generalized to the tasks below the line, image tasks are well generalized 🤯
+""")
+st.markdown(""" """)
+st.image("pages/Painter/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Images Speak in Images: A Generalist Painter for In-Context Visual Learning](https://arxiv.org/abs/2212.02499)
+by Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang (2022)
+[GitHub](https://github.com/baaivision/Painter)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("LLaVA-NeXT")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("SegGPT")

pages/11_SegGPT.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("SegGPT")
+st.success("""[Original tweet](https://x.com/mervenoyann/status/1773056450790666568) (March 27, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""SegGPT is a vision generalist on image segmentation, quite like GPT for computer vision ✨
+It comes with the last release of 🤗 Transformers 🎁
+Technical details, demo and how-to's under this!
+""")
+st.markdown(""" """)
+st.image("pages/SegGPT/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""SegGPT is an extension of the <a href='Painter' target='_self'>Painter</a> where you speak to images with images: the model takes in an image prompt, transformed version of the image prompt, the actual image you want to see the same transform, and expected to output the transformed image.
+SegGPT consists of a vanilla ViT with a decoder on top (linear, conv, linear). The model is trained on diverse segmentation examples, where they provide example image-mask pairs, the actual input to be segmented, and the decoder head learns to reconstruct the mask output. 👇🏻
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/SegGPT/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+This generalizes pretty well!
+The authors do not claim state-of-the-art results as the model is mainly used zero-shot and few-shot inference. They also do prompt tuning, where they freeze the parameters of the model and only optimize the image tensor (the input context).
+""")
+st.markdown(""" """)
+st.image("pages/SegGPT/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Thanks to 🤗 Transformers you can use this model easily! See [here](https://t.co/U5pVpBhkfK).
+""")
+st.markdown(""" """)
+st.image("pages/SegGPT/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+I have built an app for you to try it out. I combined SegGPT with Depth Anything Model, so you don't have to upload image mask prompts in your prompt pair 🤗
+Try it [here](https://t.co/uJIwqJeYUy). Also check out the [collection](https://t.co/HvfjWkAEzP).
+""")
+st.markdown(""" """)
+st.image("pages/SegGPT/image_5.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284)
+by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang (2023)
+[GitHub](https://github.com/baaivision/Painter)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Painter")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Grounding DINO")

pages/12_Grounding_DINO.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Grounding DINO")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1780558859221733563) (April 17, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""
+We have merged Grounding DINO in 🤗 Transformers 🦖
+It's an amazing zero-shot object detection model, here's why 🧶
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""There are two zero-shot object detection models as of now, one is OWL series by Google Brain and the other one is Grounding DINO 🦕
+Grounding DINO pays immense attention to detail ⬇️
+Also [try yourself](https://t.co/UI0CMxphE7).
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_2.jpeg", use_column_width=True)
+st.image("pages/Grounding_DINO/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""I have also built another [application](https://t.co/4EHpOwEpm0) for GroundingSAM, combining GroundingDINO and Segment Anything by Meta for cutting edge zero-shot image segmentation.
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Grounding DINO is essentially a model with connected image encoder (Swin transformer), text encoder (BERT) and on top of both, a decoder that outputs bounding boxes 🦖
+This is quite similar to <a href='OWLv2' target='_self'>OWL series</a>, which uses a ViT-based detector on CLIP.
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_5.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""The authors train Swin-L/T with BERT contrastively (not like CLIP where they match the images to texts by means of similarity) where they try to approximate the region outputs to language phrases at the head outputs 🤩
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_6.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""The authors also form the text features on the sub-sentence level.
+This means it extracts certain noun phrases from training data to remove the influence between words while removing fine-grained information.
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_7.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Thanks to all of this, Grounding DINO has great performance on various REC/object detection benchmarks 🏆📈
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_8.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Thanks to 🤗 Transformers, you can use Grounding DINO very easily!
+You can also check out [NielsRogge](https://twitter.com/NielsRogge)'s [notebook here](https://t.co/8ADGFdVkta).
+""")
+st.markdown(""" """)
+st.image("pages/Grounding_DINO/image_9.jpeg", use_column_width=True)
+st.info("""Ressources:
+[Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)
+by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang (2023)
+[GitHub](https://github.com/IDEA-Research/GroundingDINO)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/grounding-dino)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("SegGPT")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("DocOwl 1.5")

pages/13_DocOwl_1.5.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("DocOwl 1.5")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝
+Time to dive in and learn more 🧶
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself.
+Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM.
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen.
+Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM.
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""They train the model various downstream tasks including:
+- document understanding (DUE benchmark and more)
+- table parsing (TURL, PubTabNet)
+- chart parsing (PlotQA and more)
+- image parsing (OCR-CC)
+- text localization (DocVQA and more)
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_5.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+They contribute a new model called DocOwl 1.5-Chat by:
+1. creating a new document-chat dataset with questions from document VQA datasets
+2. feeding them to ChatGPT to get long answers
+3. fine-tune the base model with it (which IMO works very well!)
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_6.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Resulting generalist model and the chat model are pretty much state-of-the-art 😍
+Below you can see how it compares to fine-tuned models.
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_7.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR).
+The [Space](https://t.co/57E9DbNZXf).
+""")
+st.markdown(""" """)
+st.image("pages/DocOwl_1.5/image_8.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895)
+by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)
+[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Grounding DINO")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("PLLaVA")

pages/14_PLLaVA.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("PLLaVA")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1786336055425138939) (May 3, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Parameter-free LLaVA for video captioning works like magic! 🤩 Let's take a look!
+""")
+st.markdown(""" """)
+st.image("pages/PLLaVA/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Most of the video captioning models work by downsampling video frames to reduce computational complexity and memory requirements without losing a lot of information in the process.
+PLLaVA on the other hand, uses pooling! 🤩
+How? 🧐
+It takes in frames of video, passed to ViT and then projection layer, and then output goes through average pooling where input shape is (# frames, width, height, text decoder input dim) 👇
+""")
+st.markdown(""" """)
+st.image("pages/PLLaVA/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Pooling operation surprisingly reduces the loss of spatial and temporal information. See below some examples on how it can capture the details 🤗
+""")
+st.markdown(""" """)
+st.image("pages/PLLaVA/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""According to authors' findings, it performs way better than many of the existing models (including proprietary VLMs) and scales very well (on text decoder).
+""")
+st.markdown(""" """)
+st.image("pages/PLLaVA/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Model repositories 🤗 [7B](https://t.co/AeSdYsz1U7), [13B](https://t.co/GnI1niTxO7), [34B](https://t.co/HWAM0ZzvDc)
+Spaces🤗 [7B](https://t.co/Oms2OLkf7O), [13B](https://t.co/C2RNVNA4uR)
+""")
+st.markdown(""" """)
+st.info("""
+Ressources:
+[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994)
+by Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng (2024)
+[GitHub](https://github.com/magic-research/PLLaVA)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("DocOwl 1.5")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("CuMo")

pages/15_CuMo.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("CuMo")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1790665706205307191) (May 15, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""
+It's raining vision language models ☔️
+CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part 🤓
+""")
+st.markdown(""" """)
+st.image("pages/CuMo/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts.
+""")
+st.markdown(""" """)
+st.image("pages/CuMo/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning 👇
+""")
+st.markdown(""" """)
+st.image("pages/CuMo/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+It works very well (also tested myself) that it outperforms the previous SOTA of it's size <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a>! 😍
+I wonder how it would compare to IDEFICS2-8B You can try it yourself [here](https://t.co/MLIYKVh5Ee).
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/CuMo/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949)
+by Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen (2024)
+[GitHub](https://github.com/SHI-Labs/CuMo)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("PLLaVA")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("DenseConnector")

pages/16_DenseConnector.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("DenseConnector")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1796089181988352216) (May 30, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Do we fully leverage image encoders in vision language models? 👀
+A new paper built a dense connector that does it better! Let's dig in 🧶
+""")
+st.markdown(""" """)
+st.image("pages/DenseConnector/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected 📖
+This [paper](https://t.co/DPQzbj0eWm) explores using intermediate states of image encoder and not a single output 🤩
+""")
+st.markdown(""" """)
+st.image("pages/DenseConnector/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration (each of them just take intermediate outputs and put them together in different ways, see below).
+""")
+st.markdown(""" """)
+st.image("pages/DenseConnector/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5.
+""")
+st.markdown(""" """)
+st.image("pages/DenseConnector/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+I tried the [model](https://huggingface.co/spaces/HuanjinYao/DenseConnector-v1.5-8B) and it seems to work very well 🥹
+The authors have released various [checkpoints](https://t.co/iF8zM2qvDa) based on different decoders (Vicuna 7/13B and Llama 3-8B).
+""")
+st.markdown(""" """)
+st.image("pages/DenseConnector/image_5.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Dense Connector for MLLMs](https://arxiv.org/abs/2405.13800)
+by Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang (2024)
+[GitHub](https://github.com/HJYao00/DenseConnector)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("CuMo")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Depth Anything v2")

pages/17_Depth_Anything_V2.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Depth Anything V2")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1803063120354492658) (June 18, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""
+I love Depth Anything V2 😍
+It’s <a href='Depth_Anything' target='_self'>Depth Anything</a>, but scaled with both larger teacher model and a gigantic dataset! Let’s unpack 🤓🧶!
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/Depth_Anything_v2/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out what’s up with using synthetic images vs real images for MDE:
+🔖 Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc)
+🔖 Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but there’s a distribution shift between real and synthetic images, and they have restricted scene coverage
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything_v2/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model can’t generalize well (but large models generalize inherently anyway) 🧐
+But they still fail encountering real images that have wide distribution in labels 🥲
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything_v2/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Depth Anything v2 framework is to...
+🦖 Train a teacher model based on DINOv2-G based on 595K synthetic images
+🏷️ Label 62M real images using teacher model
+🦕 Train a student model using the real images labelled by teacher
+Result: 10x faster and more accurate than Marigold!
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything_v2/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!
+I have created a [collection](https://t.co/3fAB9b2sxi) that has the models, the dataset, the demo and CoreML converted model 😚
+""")
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Depth Anything V2](https://arxiv.org/abs/2406.09414)
+by Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao (2024)
+[GitHub](https://github.com/DepthAnything/Depth-Anything-V2)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/depth_anything_v2)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("DenseConnector")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Florence-2")

pages/18_Florence-2.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Florence-2")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1803769866878623819) (June 20, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Florence-2 is a new vision foundation model by Microsoft capable of a wide variety of tasks 🤯
+Let's unpack! 🧶
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+This model is can handle tasks that vary from document understanding to semantic segmentation 🤩
+[Demo](https://t.co/7YJZvjhw84) | [Collection](https://t.co/Ub7FGazDz1)
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The difference from previous models is that the authors have compiled a dataset that consists of 126M images with 5.4B annotations labelled with their own data engine ↓↓
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The dataset also offers more variety in annotations compared to other datasets, it has region level and image level annotations with more variety in semantic granularity as well!
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The model is a similar architecture to previous models, an image encoder, a multimodality encoder with text decoder.
+The authors have compiled the multitask dataset with prompts for each task which makes the model trainable on multiple tasks 🤗
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_5.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+You also fine-tune this model on any task of choice, the authors also released different results on downstream tasks and report their results when un/freezing vision encoder 🤓📉
+They have released fine-tuned models too, you can find them in the collection above 🤗
+""")
+st.markdown(""" """)
+st.image("pages/Florence-2/image_6.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks](https://arxiv.org/abs/2311.06242)
+by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan (2023)
+[Hugging Face blog post](https://huggingface.co/blog/finetune-florence2)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Depth Anything V2")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("4M-21")

pages/19_4M-21.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("4M-21")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1804138208814309626) (June 21, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""
+EPFL and Apple just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀
+Let's unpack 🧶
+""")
+st.markdown(""" """)
+st.image("pages/4M-21/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""4M is a multimodal training [framework](https://t.co/jztLublfSF) introduced by Apple and EPFL.
+Resulting model takes image and text and output image and text 🤩
+[Models](https://t.co/1LC0rAohEl) | [Demo](https://t.co/Ra9qbKcWeY)
+""")
+st.markdown(""" """)
+st.video("pages/4M-21/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""
+This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:
+input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
+""")
+st.markdown(""" """)
+st.image("pages/4M-21/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation 🖼️
+The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️
+""")
+st.markdown(""" """)
+st.image("pages/4M-21/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+In the project page you can also see the model's text-to-image and steered generation capabilities with model's own outputs as control masks!
+""")
+st.markdown(""" """)
+st.video("pages/4M-21/video_2.mp4", format="video/mp4")
+st.markdown(""" """)
+st.info("""
+Ressources
+[4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities](https://arxiv.org/abs/2406.09406) by Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir (2024)
+[GitHub](https://github.com/apple/ml-4m/)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Florence-2")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("RT-DETR")

pages/1_MobileSAM.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("MobileSAM")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1738959605542076863) (December 24, 2023)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Read the MobileSAM paper this weekend 📖 Sharing some insights!
+The idea 💡: SAM model consist of three parts, a heavy image encoder, a prompt encoder (prompt can be text, bounding box, mask or point) and a mask decoder.
+To make the SAM model smaller without compromising from the performance, the authors looked into three types of distillation.
+First one is distilling the decoder outputs directly (a more naive approach) with a completely randomly initialized small ViT and randomly initialized mask decoder.
+However, when the ViT and the decoder are both in a bad state, this doesn't work well.
+""")
+st.markdown(""" """)
+st.image("pages/MobileSAM/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The second type of distillation is called semi-coupled, where the authors only randomly initialized the ViT image encoder and kept the mask decoder.
+This is called semi-coupled because the image encoder distillation still depends on the mask decoder (see below 👇)
+""")
+st.markdown(""" """)
+st.image("pages/MobileSAM/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The last type of distillation, [decoupled distillation](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Decoupled_Knowledge_Distillation_CVPR_2022_paper.pdf), is the most intuitive IMO.
+The authors have "decoupled" image encoder altogether and have frozen the mask decoder and didn't really distill based on generated masks.
+This makes sense as the bottleneck here is the encoder itself and most of the time, distillation works well with encoding.
+""")
+st.markdown(""" """)
+st.image("pages/MobileSAM/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Finally, they found out that decoupled distillation performs better than coupled distillation by means of mean IoU and requires much less compute! ♥️
+""")
+st.markdown(""" """)
+st.image("pages/MobileSAM/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Wanted to leave some links here if you'd like to try yourself 👇
+- MobileSAM [demo](https://huggingface.co/spaces/dhkim2810/MobileSAMMobileSAM)
+- Model [repository](https://huggingface.co/dhkim2810/MobileSAM)
+If you'd like to experiment around TinyViT, [timm library](https://huggingface.co/docs/timm/index) ([Ross Wightman](https://x.com/wightmanr)) has a bunch of [checkpoints available](https://huggingface.co/models?sort=trending&search=timm%2Ftinyvit).
+""")
+st.markdown(""" """)
+st.image("pages/MobileSAM/image_5.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Faster Segment Anything: Towards Lightweight SAM for Mobile Applications](https://arxiv.org/abs/2306.14289)
+by Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong (2023)
+[GitHub](https://github.com/ChaoningZhang/MobileSAM)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3= st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Home")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("OneFormer")

pages/20_RT-DETR.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("RT-DETR")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1807790959884665029) (July 1, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Real-time DEtection Transformer (RT-DETR) landed in 🤗 Transformers with Apache 2.0 license 😍
+Do DETRs Beat YOLOs on Real-time Object Detection? Keep reading 👀
+""")
+st.markdown(""" """)
+st.video("pages/RT-DETR/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""
+Short answer, it does! 📖 [notebook](https://t.co/NNRpG9cAEa), 🔖 [models](https://t.co/ctwWQqNcEt), 🔖 [demo](https://t.co/VrmDDDjoNw)
+YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲
+Transformer-based models on the other hand are computationally not as efficient 🥲
+Isn't there something in between? Enter RT-DETR!
+The authors combined CNN backbone, multi-stage hybrid decoder (combining convs and attn) with a transformer decoder ⇓
+""")
+st.markdown(""" """)
+st.image("pages/RT-DETR/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+In the paper, authors also claim one can adjust speed by changing decoder layers without retraining altogether.
+They also conduct many ablation studies and try different decoders.
+""")
+st.markdown(""" """)
+st.image("pages/RT-DETR/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors find out that the model performs better in terms of speed and accuracy compared to the previous state-of-the-art 🤩
+""")
+st.markdown(""" """)
+st.image("pages/RT-DETR/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[DETRs Beat YOLOs on Real-time Object Detection](https://arxiv.org/abs/2304.08069)
+by Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen (2023)
+[GitHub](https://github.com/lyuwenyu/RT-DETR/)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/rt_detr)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("4M-21")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Llava-NeXT-Interleave")

pages/21_Llava-NeXT-Interleave.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Llava-NeXT-Interleave")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1813560292397203630) (July 17, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""The vision language model in this video is 0.5B and can take in image, video and 3D! 🤯
+Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data keep reading ⥥⥥
+""")
+st.markdown(""" """)
+st.video("pages/Llava-NeXT-Interleave/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""This model comes with 0.5B, 7B and 7B-DPO variants, all can be used with Transformers 😍
+[Collection of models](https://t.co/sZsaglSXa3) | [Demo](https://t.co/FbpaMWJY8k)
+See how to use below 👇🏻
+""")
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Authors of this paper have explored training <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a> on interleaved data where the data consists of multiple modalities, including image(s), video, 3D 📚
+They have discovered that interleaved data increases results across all benchmarks!
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The model can do task transfer from single image tasks to multiple images 🤯
+The authors have trained the model on single images and code yet the model can solve coding with multiple images.
+""")
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Same applies to other modalities, see below for video:
+""")
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The model also has document understanding capabilities and many real-world application areas.
+""")
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_5.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+This release also comes with the dataset this model was fine-tuned on 📖 [M4-Instruct-Data](https://t.co/rutXMtNC0I)
+""")
+st.markdown(""" """)
+st.image("pages/Llava-NeXT-Interleave/image_6.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/)
+by Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li (2024)
+[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Interleave.md)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("RT-DETR")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Chameleon")

pages/22_Chameleon.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Chameleon")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1814278511785312320) (July 19, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Chameleon 🦎 by Meta is now available in 🤗 Transformers.
+A multimodal model that comes in 7B and 34B sizes 🤩
+But what makes this model so special? Keep reading ⇣
+""")
+st.markdown(""" """)
+st.video("pages/Chameleon/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""
+[Demo](https://t.co/GsGE17fSdI) | [Models](https://t.co/cWUiVbsRz6)
+Find below the API to load this model locally use it ⬇️
+""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Chameleon is a unique model: it attempts to scale early fusion 🤨
+But what is early fusion?
+Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder.""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏
+""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training.
+This way they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO) .
+""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.
+""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_5.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+One can also do text-only prompting, authors noted the model catches up with larger LLMs, and you can also see how it compares to VLMs with image-text prompting.
+""")
+st.markdown(""" """)
+st.image("pages/Chameleon/image_6.jpg", use_column_width=True)
+st.image("pages/Chameleon/image_6.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://arxiv.org/abs/2405.09818)
+by Chameleon Team (2024)
+[GitHub](https://github.com/facebookresearch/chameleon)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/chameleon)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Llava-NeXT-Interleave")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Video-LLaVA")

pages/23_Video-LLaVA.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Video-LLaVA")
+st.success("""[Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""We have recently merged Video-LLaVA to 🤗 Transformers! 🎞️
+What makes this model different? Keep reading ⇊
+""")
+st.markdown(""" """)
+st.video("pages/Video-LLaVA/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""[Demo](https://t.co/MVP14uEj9e) | [Model](https://t.co/oqSCMUqwJo)
+See below how to initialize the model and processor and infer ⬇️
+""")
+st.markdown(""" """)
+st.image("pages/Video-LLaVA/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.
+""")
+st.markdown(""" """)
+st.image("pages/Video-LLaVA/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.
+""")
+st.markdown(""" """)
+st.image("pages/Video-LLaVA/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models.
+It's a relatively older model but ahead of it's time and works very well!
+""")
+st.markdown(""" """)
+st.image("pages/Video-LLaVA/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122)
+by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023)
+[GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava)
+""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Chameleon")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("SAMv2")

pages/24_SAMv2.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("SAMv2")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1818675981634109701) (July 31, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""SAMv2 is just mindblowingly good 😍
+Learn what makes this model so good at video segmentation, keep reading 🦆⇓
+""")
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col2:
+    st.video("pages/SAMv2/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""
+Check out the [demo](https://t.co/35ixEZgPaf) by [skalskip92](https://x.com/skalskip92) to see how to use the model locally.
+Check out Meta's [demo](https://t.co/Bcbli9Cfim) where you can edit segmented instances too!
+Segment Anything Model by Meta was released as a universal segmentation model in which you could prompt a box or point prompt to segment the object of interest
+SAM consists of an image encoder to encode images, a prompt encoder to encode prompts, then outputs of these two are given to a mask decoder to generate masks.
+""")
+st.markdown(""" """)
+st.image("pages/SAMv2/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+However SAM doesn't naturally track object instances in videos, one needs to make sure to prompt the same mask or point prompt for that instance in each frame and feed each frame, which is infeasible 😔
+But don't fret, that is where SAMv2 comes in with a memory module!
+SAMv2 defines a new task called "masklet prediction" here masklet refers to the same mask instance throughout the frames 🎞️
+Unlike SAM, SAM 2 decoder is not fed the image embedding directly from an image encoder, but attention of memories of prompted frames and object pointers.
+""")
+st.markdown(""" """)
+st.image("pages/SAMv2/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+🖼️ These "memories" are essentially past predictions of object of interest up to a number of recent frames,
+and are in form of feature maps of location info (spatial feature maps).
+👉🏻 The object pointers are high level semantic information of the object of interest based on.
+Just like SAM paper SAMv2 depends on a data engine, and the dataset it generated comes with the release: SA-V 🤯
+This dataset is gigantic, it has 190.9K manual masklet annotations and 451.7K automatic masklets!
+""")
+st.markdown(""" """)
+st.image("pages/SAMv2/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Initially they apply SAM to each frame to assist human annotators to annotate a video at six FPS for high quality data,
+in the second phase they add SAM and SAM2 to generate masklets across time consistently. Finally they use SAM2 to refine the masklets.
+They have evaluated this model on J&F score (Jaccard Index + F-measure for contour acc) which is used to evaluate zero-shot
+video segmentation benchmarks.
+SAMv2 seems to outperform two previously sota models that are built on top of SAM! 🥹
+""")
+st.markdown(""" """)
+st.image("pages/SAMv2/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[SAM 2: Segment Anything in Images and Videos]()
+by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer (2024)
+[GitHub](https://github.com/facebookresearch/segment-anything-2)
+[Hugging Face documentation]()""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Video-LLaVA")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Home")

pages/2_Oneformer.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("OneFormer")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1739707076501221608) (December 26, 2023)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""
+OneFormer: one model to segment them all? 🤯
+I was looking into paperswithcode leaderboards when I came across OneFormer for the first time so it was time to dig in!
+""")
+st.markdown(""" """)
+st.image("pages/OneFormer/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""OneFormer is a "truly universal" model for semantic, instance and panoptic segmentation tasks ⚔️
+What makes is truly universal is that it's a single model that is trained only once and can be used across all tasks 👇
+""")
+st.markdown(""" """)
+st.image("pages/OneFormer/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The enabler here is the text conditioning, i.e. the model is given a text query that states task type along with the appropriate input, and using contrastive loss, the model learns the difference between different task types 👇
+""")
+st.markdown(""" """)
+st.image("pages/OneFormer/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Thanks to 🤗 Transformers, you can easily use the model!
+I have drafted a [notebook](https://t.co/cBylk1Uv20) for you to try right away 😊
+You can also check out the [Space](https://t.co/31GxlVo1W5) without checking out the code itself.
+""")
+st.markdown(""" """)
+st.image("pages/OneFormer/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)
+by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi (2022)
+[GitHub](https://github.com/SHI-Labs/OneFormer)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/oneformer)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("MobileSAM")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("VITMAE")

pages/3_VITMAE.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("VITMAE")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1740688304784183664) (December 29, 2023)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Just read VitMAE paper, sharing some highlights 🧶
+ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder.
+The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image!
+""")
+st.markdown(""" """)
+st.image("pages/VITMAE/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!).
+Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder.
+The decoder then tries to reconstruct the original image.
+""")
+st.markdown(""" """)
+st.image("pages/VITMAE/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯
+""")
+st.markdown(""" """)
+st.image("pages/VITMAE/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on [Huggingface](https://t.co/didvTL9Zkm).
+We've built a [demo](https://t.co/PkuACJiKrB) for you to see the intermediate outputs and reconstruction by VITMAE.
+Also there's a nice [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) by [@NielsRogge](https://twitter.com/NielsRogge).
+""")
+st.markdown(""" """)
+st.image("pages/VITMAE/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v3)
+by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021)
+[GitHub](https://github.com/facebookresearch/mae)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/vit_mae)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("OneFormer")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("DINOV2")

pages/4M-21/4M-21.md ADDED Viewed

	@@ -0,0 +1,32 @@

+EPFL and Apple just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀 Let's unpack 🧶
+![image_1](image_1.jpg)
+4M is a multimodal training [framework](https://t.co/jztLublfSF) introduced by Apple and EPFL.
+Resulting model takes image and text and output image and text 🤩
+[Models](https://t.co/1LC0rAohEl) | [Demo](https://t.co/Ra9qbKcWeY)
+![video_1](video_1.mp4)
+This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data: input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
+![image_2](image_2.jpg)
+This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation 🖼️
+The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️
+![image_3](image_3.jpg)
+In the project page you can also see the model's text-to-image and steered generation capabilities with model's own outputs as control masks!
+![video_2](video_2.mp4)
+> [!TIP]
+Ressources:
+[4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities](https://arxiv.org/abs/2406.09406)
+by Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir (2024)
+[GitHub](https://github.com/apple/ml-4m/)
+> [!NOTE]
+[Original tweet](https://twitter.com/mervenoyann/status/1804138208814309626) (June 21, 2024)

pages/4M-21/image_1.jpg ADDED Viewed

pages/4M-21/image_2.jpg ADDED Viewed

pages/4M-21/image_3.jpg ADDED Viewed

pages/4M-21/video_1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9cd40cb677314a9384da8e644ad3bb9eba3e23a39e776f5ce8c1437ebf3d06d8
+size 1073547

pages/4M-21/video_2.mp4 ADDED Viewed

Binary file (461 kB). View file

pages/4_DINOv2.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("DINOv2")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1743290724672495827) (January 5, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""DINOv2 is the king for self-supervised learning in images 🦖🦕
+But how does it work? I've tried to explain how it works but let's expand on it 🧶
+""")
+st.markdown(""" """)
+st.image("pages/DINOv2/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+DINOv2 is essentially DINO on steroids, so let's talk about DINOv1 first 🦕
+It's essentially a pre-training technique to train ViTs with self-supervision, that uses an unusual way of distillation 🧟‍♂️👨🏻‍🏫.
+Distillation is a technique where there's a large pre-trained model (teacher), and you have a smaller model (student) initialized randomly.
+Then during training the student, you take both models'outputs, calculate divergence between them and then update the loss accordingly.
+In this case, we have no labels! And the teacher is not pretrained!!!! 🤯
+Well, the outputs here are the distributions, and teacher is iteratively updated according to student, which is called exponential moving average.
+""")
+st.markdown(""" """)
+st.image("pages/DINOv2/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+DINO doesn't use any contrastive loss or clustering but only cross entropy loss (again, what a paper) which leads the model to collapse.
+This can be avoided by normalizing the teacher output multiple times, but authors center (to squish logits) and sharpen (through temperature) the teacher outputs.
+Finally, local and global crops are given to student and only global crops are given to teacher and this sort of pushes student to identify context from small parts of the image.
+""")
+st.markdown(""" """)
+st.image("pages/DINOv2/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""How does DINOv2 improve DINO?
+⚡️ More efficient thanks to FSDP and Flash Attention
+🦖 Has a very efficient data augmentation technique that apparently scales to 100M+ images (put below)
+👨🏻‍🏫 Uses ViT-g instead of training from scratch
+""")
+st.markdown(""" """)
+st.image("pages/DINOv2/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The model is so powerful that you can use DINOv2 even with knn or linear classifiers without need to fine-tuning!
+But if you'd like DINOv2 to work even better, [NielsRogge](https://twitter.com/NielsRogge) has built a [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DINOv2/Fine\_tune\_DINOv2\_for\_image\_classification\_%5Bminimal%5D.ipynb) to fine-tune it using Trainer 📖
+He also has a [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DINOv2/Train\_a\_linear\_classifier\_on\_top\_of\_DINOv2\_for\_semantic\_segmentation.ipynb) if you feel like training a linear classifier only 📔
+All the different DINO/v2 model checkpoints are [here](https://huggingface.co/models?search=dinoLastly).
+Lastly, special thanks to [ykilcher](https://twitter.com/ykilcher) as I couldn't make sense of certain things in the paper and watched his awesome [tutorial](https://youtube.com/watch?v=h3ij3F) 🤩
+""")
+st.markdown(""" """)
+st.info("""
+Ressources:
+[DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski (2023)
+[GitHub](https://github.com/facebookresearch/dinov2)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/dinov2)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("VITMAE")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("SigLIP")

pages/5_SigLIP.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("SigLIP")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1745476609686089800) (January 11. 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""SigLIP just got merged to 🤗 Transformers and it's super easy to use!
+To celebrate this, I have created a repository on various SigLIP based projects!
+But what is it and how does it work?
+SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs.
+The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
+""")
+st.markdown(""" """)
+st.image("pages/SigLIP/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Highlights✨
+🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
+😍 More performant than CLIP on zero-shot
+🗣️ Authors trained a multilingual model too!
+⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k (see saturation on perf below)
+""")
+st.markdown(""" """)
+st.image("pages/SigLIP/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Below you can find prior CLIP models and SigLIP across different image encoder sizes and their performance on different datasets 👇🏻
+""")
+st.markdown(""" """)
+st.image("pages/SigLIP/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+With 🤗 Transformers integration there comes zero-shot-image-classification pipeline, makes SigLIP super easy to use!
+""")
+st.markdown(""" """)
+st.image("pages/SigLIP/image_4.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+What to use SigLIP for? 🧐
+Honestly the possibilities are endless, but you can use it for image/text retrieval, zero-shot classification, training multimodal models!
+I have made a repository with notebooks and applications that are also hosted on [Spaces](https://t.co/Ah1CrHVuPY).
+I have built ["Draw to Search Art"](https://t.co/DcmQWMc1qd) where you can input image (upload one or draw) and search among 10k images in wikiart!
+I've also built apps to [compare](https://t.co/m699TMvuW9) CLIP and SigLIP outputs.
+""")
+st.markdown(""" """)
+st.image("pages/SigLIP/image_5.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343)
+by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (2023)
+[GitHub](https://github.com/google-research/big_vision)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/siglip)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("DINOv2")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("OWLv2")

pages/6_OWLv2.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("OWLv2")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1748411972675150040) (January 19, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉🧶""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first 📝
+OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training 👀
+What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.
+""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP).
+They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune.
+""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""During fine-tuning for object detection, they calculate the loss over bipartite matches.
+Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.
+OWL-ViT is very scalable.
+One can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need supervision.
+Moreover, only scaling the encoders creates a bottleneck after a while.
+""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data.
+""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Thanks to this, OWLv2 scaled very well and is tops leaderboards on open vocabulary object detection 👑
+""")
+st.markdown(""" """)
+st.image("pages/OWLv2/image_5.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Want to try OWL models?
+I've created a [notebook](https://t.co/ick5tA6nyx) for you to see how to use it with 🤗 Transformers.
+If you want to play with it directly, you can use this [Space](https://t.co/oghdLOtoa5).
+All the models and the applications of OWL-series is in this [collection](https://huggingface.co/collections/merve/owl-series-65aaac3114e6582c300544df).
+""")
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683)
+by Matthias Minderer, Alexey Gritsenko, Neil Houlsby (2023)
+[GitHub](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/owlv2)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("SigLIP")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Backbone")

pages/7_Backbone.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Backbone")
+st.success("""[Original tweet](https://x.com/mervenoyann/status/1749841426177810502) (January 23, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Many cutting-edge computer vision models consist of multiple stages:
+➰ backbone extracts the features,
+➰ neck refines the features,
+➰ head makes the detection for the task.
+Implementing this is cumbersome, so 🤗 Transformers has an API for this: Backbone!
+""")
+st.markdown(""" """)
+st.image("pages/Backbone/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Let's see an example of such model.
+Assuming we would like to initialize a multi-stage instance segmentation model with ResNet backbone and MaskFormer neck and a head, you can use the backbone API like following (left comments for clarity) 👇
+""")
+st.markdown(""" """)
+st.image("pages/Backbone/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""One can also use a backbone just to get features from any stage. You can initialize any backbone with `AutoBackbone` class.
+See below how to initialize a backbone and getting the feature maps at any stage 👇
+""")
+st.markdown(""" """)
+st.image("pages/Backbone/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Backbone API also supports any timm backbone of your choice! Check out a variation of timm backbones [here](https://t.co/Voiv0QCPB3).
+""")
+st.markdown(""" """)
+st.image("pages/Backbone/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Leaving some links 🔗
+📖 I've created a [notebook](https://t.co/PNfmBvdrtt) for you to play with it
+📒 [Backbone API docs](https://t.co/Yi9F8qAigO)
+📓 [AutoBackbone docs](https://t.co/PGo9oILHDw) (all written with love by me!💜)""")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("OWLv2")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Depth Anything")

pages/8_Depth_Anything.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("Depth Anything")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1750531698008498431) (January 25, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""Explaining a new state-of-the-art monocular depth estimation model: Depth Anything ✨🧶
+It has just been integrated in transformers for super-easy use.
+We compared it against DPTs and benchmarked it as well! You can find the usage, benchmark, demos and more below 👇
+""")
+st.markdown(""" """)
+st.video("pages/Depth_Anything/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.markdown("""
+The paper starts with highlighting previous depth estimation methods and the limitations regarding the data coverage. 👀
+The model's success heavily depends on unlocking the use of unlabeled datasets, although initially the authors used self-training and failed.
+What the authors have done:
+➰ Train a teacher model on labelled dataset
+➰ Guide the student using teacher and also use unlabelled datasets pseudolabelled by the teacher. However, this was the cause of the failure, as both architectures were similar, the outputs were the same.
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+So the authors have added a more difficult optimization target for student to learn additional knowledge on unlabeled images that went through color jittering, distortions, Gaussian blurring and spatial distortion, so it can learn more invariant representations from them.
+The architecture consists of <a href='DINOv2' target='_self'>DINOv2</a> encoder to extract the features followed by DPT decoder. At first, they train the teacher model on labelled images, and then they jointly train the student model and add in the dataset pseudo-labelled by ViT-L.
+""", unsafe_allow_html=True)
+st.markdown(""" """)
+st.image("pages/Depth_Anything/image_1.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""Thanks to this, Depth Anything performs very well! I have also benchmarked the inference duration of the model against different models here. I also ran `torch.compile` benchmarks across them and got nice speed-ups 🚀
+On T4 GPU, mean of 30 inferences for each. Inferred using `pipeline` (pre-processing and post-processing included with model inference).
+| Model/Batch Size              | 16        | 4        | 1       |
+| ----------------------------- | --------- | -------- | ------- |
+| intel/dpt-large               | 2709.652  | 667.799  | 172.617 |
+| facebook/dpt-dinov2-small-nyu | 2534.854  | 654.822  | 159.754 |
+| facebook/dpt-dinov2-base-nyu  | 4316.8733 | 1090.824 | 266.699 |
+| Intel/dpt-beit-large-512      | 7961.386  | 2036.743 | 497.656 |
+| depth-anything-small          | 1692.368  | 415.915  | 143.379 |
+`torch.compile`’s benchmarks with reduce-overhead mode: we have compiled the model and loaded it to the pipeline for the benchmarks to be fair.
+| Model/Batch Size              | 16       | 4        | 1       |
+| ----------------------------- | -------- | -------- | ------- |
+| intel/dpt-large               | 2556.668 | 645.750  | 155.153 |
+| facebook/dpt-dinov2-small-nyu | 2415.25  | 610.967  | 148.526 |
+| facebook/dpt-dinov2-base-nyu  | 4057.909 | 1035.672 | 245.692 |
+| Intel/dpt-beit-large-512      | 7417.388 | 1795.882 | 426.546 |
+| depth-anything-small          | 1664.025 | 384.688  | 97.865  |
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything/image_2.jpg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+You can use Depth Anything easily thanks to 🤗 Transformers with three lines of code! ✨
+We have also built an app for you to [compare different depth estimation models](https://t.co/6uq4osdwWG) 🐝 🌸
+See all the available Depth Anything checkpoints [here](https://t.co/Ex0IIyx7XC).
+""")
+st.markdown(""" """)
+st.image("pages/Depth_Anything/image_3.jpg", use_column_width=True)
+st.markdown(""" """)
+st.info("""
+Ressources:
+[Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891)
+by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao (2024)
+[GitHub](https://github.com/LiheYoung/Depth-Anything)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/depth_anything)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Backbone")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("LLaVA-NeXT")

pages/9_LLaVA-NeXT.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import streamlit as st
+from streamlit_extras.switch_page_button import switch_page
+st.title("LLaVA-NeXT")
+st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1770832875551682563) (March 21, 2024)""", icon="ℹ️")
+st.markdown(""" """)
+st.markdown("""LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks!🤩
+For those who don't know LLaVA, it's a language model that can take image 💬
+Let's take a look, demo and more in this.
+""")
+st.markdown(""" """)
+st.image("pages/LLaVA-NeXT/image_1.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨
+LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
+- Nous-Hermes-Yi-34B
+- Mistral-7B
+- Vicuna 7B & 13B
+""")
+st.markdown(""" """)
+st.image("pages/LLaVA-NeXT/image_2.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""
+Thanks to Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜
+See below on standalone usage 👇
+""")
+st.markdown(""" """)
+st.image("pages/LLaVA-NeXT/image_3.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇
+""")
+st.markdown(""" """)
+st.image("pages/LLaVA-NeXT/image_4.jpeg", use_column_width=True)
+st.markdown(""" """)
+st.markdown("""If you want to try the code right away, here's the [notebook](https://t.co/NvoxvY9z1u).
+Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo [here](https://t.co/JTDlqMUwEh) 🤗
+""")
+st.markdown(""" """)
+st.video("pages/LLaVA-NeXT/video_1.mp4", format="video/mp4")
+st.markdown(""" """)
+st.info("""
+Ressources:
+[LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/)
+by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024)
+[GitHub](https://github.com/haotian-liu/LLaVA/tree/main)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llava_next)""", icon="📚")
+st.markdown(""" """)
+st.markdown(""" """)
+st.markdown(""" """)
+col1, col2, col3 = st.columns(3)
+with col1:
+    if st.button('Previous paper', use_container_width=True):
+        switch_page("Depth Anything")
+with col2:
+    if st.button('Home', use_container_width=True):
+        switch_page("Home")
+with col3:
+    if st.button('Next paper', use_container_width=True):
+        switch_page("Painter")

pages/Backbone/Backbone.md ADDED Viewed

	@@ -0,0 +1,31 @@

+Many cutting-edge computer vision models consist of multiple stages:
+➰ backbone extracts the features,
+➰ neck refines the features,
+➰ head makes the detection for the task.
+Implementing this is cumbersome, so 🤗 transformers has an API for this: Backbone!
+![image_1](image_1.jpg)
+Let's see an example of such model.
+Assuming we would like to initialize a multi-stage instance segmentation model with ResNet backbone and MaskFormer neck and a head, you can use the backbone API like following (left comments for clarity) 👇
+![image_2](image_2.jpg)
+One can also use a backbone just to get features from any stage. You can initialize any backbone with `AutoBackbone` class.
+See below how to initialize a backbone and getting the feature maps at any stage 👇
+![image_3](image_3.jpg)
+Backbone API also supports any timm backbone of your choice! Check out a variation of timm backbones [here](https://t.co/Voiv0QCPB3).
+![image_4](image_4.jpg)
+Leaving some links 🔗:
+📖 I've created a [notebook](https://t.co/PNfmBvdrtt) for you to play with it
+📒 [Backbone API docs](https://t.co/Yi9F8qAigO)
+📓 [AutoBackbone docs](https://t.co/PGo9oILHDw) 💜
+(all written with love by me!)
+> [!NOTE]
+[Orignial tweet](https://twitter.com/mervenoyann/status/1749841426177810502) (January 23, 2024)

pages/Backbone/image_1.jpeg ADDED Viewed

pages/Backbone/image_2.jpeg ADDED Viewed

pages/Backbone/image_3.jpeg ADDED Viewed

pages/Backbone/image_4.jpeg ADDED Viewed

pages/Chameleon/Chameleon.md ADDED Viewed

	@@ -0,0 +1,43 @@

+Chameleon 🦎 by Meta is now available in @huggingface transformers 😍
+A multimodal model that comes in 7B and 34B sizes 🤩
+But what makes this model so special? keep reading ⇣
+![video_1](video_1.mp4)
+[Demo](https://t.co/GsGE17fSdI] | [Models](https://t.co/cWUiVbsRz6)
+Find below the API to load this model locally use it ⬇️
+![image_1](image_1.jpg)
+Chameleon is a unique model: it attempts to scale early fusion 🤨 But what is early fusion?
+Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder.
+![image_2](image_2.jpg)
+Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏
+![image_3](image_3.jpg)
+Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training This way they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)
+![image_4](image_4.jpg)
+This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.
+![image_5](image_5.jpg)
+One can also do text-only prompting, authors noted the model catches up with larger LLMs, and you can also see how it compares to VLMs with image-text prompting.
+![image_6](image_6.jpg)
+![image_7](image_7.jpg)
+> [!TIP]
+Ressources:
+[Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://arxiv.org/abs/2405.09818)
+by Chameleon Team (2024)
+[GitHub](https://github.com/facebookresearch/chameleon)
+[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/chameleon)
+> [!NOTE]
+[Original tweet](https://twitter.com/mervenoyann/status/1814278511785312320) (July 19, 2024)

pages/Chameleon/image_1.jpg ADDED Viewed

pages/Chameleon/image_2.jpg ADDED Viewed

pages/Chameleon/image_3.jpg ADDED Viewed

pages/Chameleon/image_4.jpg ADDED Viewed

pages/Chameleon/image_5.jpg ADDED Viewed

pages/Chameleon/image_6.jpg ADDED Viewed

pages/Chameleon/image_7.jpg ADDED Viewed

pages/Chameleon/video_1.mp4 ADDED Viewed

Binary file (866 kB). View file

pages/CuMo/CuMo.md ADDED Viewed

	@@ -0,0 +1,24 @@

+It's raining vision language models ☔️ CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part 🤓
+![image_1](image_1.jpg)
+The authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts.
+![image_2](image_2.jpg)
+The mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning 👇
+![image_3](image_3.jpg)
+It works very well (also tested myself) that it outperforms the previous sota of it's size LLaVA NeXt! 😍 I wonder how it would compare to IDEFICS2-8B You can try it yourself [here](https://t.co/MLIYKVh5Ee).
+![image_4](image_4.jpg)
+> [!TIP]
+Ressources:
+[CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949)
+by Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen (2024)
+[GitHub](https://github.com/SHI-Labs/CuMo)
+> [!NOTE]
+[Original tweet](https://twitter.com/mervenoyann/status/1790665706205307191) (May 15, 2024)

pages/CuMo/image_1.jpg ADDED Viewed

pages/CuMo/image_2.jpg ADDED Viewed