Spaces:

SeViLA
/

SeViLA

Runtime error

App Files Files Community

shoubin commited on May 14, 2023

Commit

7e8784c

•

1 Parent(s): 0054d8c

upload_demo

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
.gitattributes +1 -0
LICENSE.txt +14 -0
MANIFEST.in +7 -0
README.md +112 -13
app.py +206 -0
app/__init__.py +26 -0
app/calculate_coco_features.py +87 -0
app/caption.py +98 -0
app/classification.py +216 -0
app/dataset_browser.py +240 -0
app/image_text_match.py +87 -0
app/main.py +25 -0
app/multimodal_search.py +230 -0
app/multipage.py +41 -0
app/text_localization.py +105 -0
app/utils.py +81 -0
app/vqa.py +63 -0
assets/.DS_Store +0 -0
assets/chain.png +0 -0
assets/model.png +0 -0
assets/teaser.png +0 -0
docs/.DS_Store +0 -0
docs/Makefile +20 -0
docs/_static/.DS_Store +0 -0
docs/_static/architecture.png +0 -0
docs/_static/logo_final.png +0 -0
docs/benchmark.rst +348 -0
docs/build_docs.sh +101 -0
docs/conf.py +56 -0
docs/getting_started.rst +233 -0
docs/index.rst +46 -0
docs/intro.rst +99 -0
docs/make.bat +35 -0
docs/requirements.txt +7 -0
docs/tutorial.configs.rst +172 -0
docs/tutorial.datasets.rst +424 -0
docs/tutorial.evaluation.rst +40 -0
docs/tutorial.models.rst +245 -0
docs/tutorial.processors.rst +233 -0
docs/tutorial.rst +13 -0
docs/tutorial.tasks.rst +184 -0
docs/tutorial.training-example.rst +145 -0
evaluate.py +92 -0
lavis/.DS_Store +0 -0
lavis/__init__.py +31 -0
lavis/__pycache__/__init__.cpython-38.pyc +0 -0
lavis/common/.DS_Store +0 -0
lavis/common/__pycache__/config.cpython-38.pyc +0 -0
lavis/common/__pycache__/dist_utils.cpython-38.pyc +0 -0

.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 demo4.mp4 filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 demo4.mp4 filter=lfs diff=lfs merge=lfs -text
+videos/*.mp4 filter=lfs diff=lfs merge=lfs -text

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+BSD 3-Clause License
+Copyright (c) 2022 Salesforce, Inc.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

MANIFEST.in ADDED Viewed

	@@ -0,0 +1,7 @@

+recursive-include lavis/configs *.yaml *.json
+recursive-include lavis/projects *.yaml *.json
+recursive-exclude lavis/datasets/download_scripts *
+recursive-exclude lavis/output *
+include requirements.txt

README.md CHANGED Viewed

@@ -1,13 +1,112 @@
----
-title: SeViLA
-emoji: 📉
-colorFrom: pink
-colorTo: yellow
-sdk: gradio
-sdk_version: 3.29.0
-app_file: app.py
-pinned: false
-license: openrail
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Self-Chained Image-Language Model for Video Localization and Question Answering
+* Authors: [Shoubin Yu](https://yui010206.github.io/), [Jaemin Cho](https://j-min.io), [Prateek Yadav](https://prateek-yadav.github.io/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)
+* [arXiv](https://arxiv.org/abs/2305.06988)
+<img src="./assets/teaser.png" alt="teaser image" width="800"/>
+<img src="./assets/model.png" alt="teaser image" width="800"/>
+<img src="./assets/chain.png" alt="teaser image" width="800"/>
+# Code structure
+```bash
+# Data & Data Preprocessing
+./sevila_data
+# Pretrained Checkpoints
+./sevila_checkpoints
+# SeViLA code
+./lavis/
+# running scripts for SeViLa localizer/answerer training/inference
+./run_scripts
+```
+# Setup
+## Install Dependencies
+1. (Optional) Creating conda environment
+```bash
+conda create -n sevila python=3.8
+conda activate sevila
+```
+2. build from source
+```bash
+pip install -e .
+```
+## Download Pretrained Models
+We pre-train SeViLA localizer on QVHighlights and hold checkpoints via [Huggingface](https://huggingface.co/Shoubin/SeViLA/resolve/main/sevila_pretrained.pth).
+Download checkpoints and put it under /sevila_checkpoints.
+The checkpoints (814.55M) contains pre-trained localizer and zero-shot answerer.
+# Dataset Preparation
+We test our model on:
++ [NExT-QA](https://doc-doc.github.io/docs/nextqa.html)
++ [STAR](https://star.csail.mit.edu/)
++ [How2QA](https://value-benchmark.github.io/index.html)
++ [TVQA](https://tvqa.cs.unc.edu/)
++ [VLEP](https://value-benchmark.github.io/index.html)
++ [QVHighlights](https://github.com/jayleicn/moment_detr)
+please download original data and preprocess them via our [scripts](sevila_data/) under ./sevila_data/ .
+# Training and Inference
+We provideo SeViLA training and inference script examples as following:
+## 1) Localizer Pre-training
+```bash
+sh run_scripts/sevila/pre-train/pretrain_qvh.sh
+```
+## 2) Localizer Self-refinement
+```bash
+sh run_scripts/sevila/refinement/nextqa_sr.sh
+```
+## 3) Answerer Fine-tuning
+```bash
+sh run_scripts/sevila/finetune/nextqa_ft.sh
+```
+## 4) Inference
+```bash
+sh run_scripts/sevila/inference/nextqa_infer.sh
+```
+# Acknowledgments
+We thank the developers of [LAVIS](https://github.com/salesforce/LAVIS), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [CLIP](https://github.com/openai/CLIP), [All-in-one](https://github.com/showlab/all-in-one), for their public code release.
+# Reference
+Please cite our paper if you use our models in your works:
+```bibtex
+@misc{yu2023selfchained,
+      title={Self-Chained Image-Language Model for Video Localization and Question Answering},
+      author={Shoubin Yu and Jaemin Cho and Prateek Yadav and Mohit Bansal},
+      year={2023},
+      eprint={2305.06988},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}

app.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import gradio as gr
+import os
+import torch
+from torchvision import transforms
+from lavis.processors import transforms_video
+from lavis.datasets.data_utils import load_video_demo
+from lavis.processors.blip_processors import ToUint8, ToTHWC
+from lavis.models.sevila_models.sevila import SeViLA
+from typing import Optional
+import warnings
+# model config
+img_size = 224
+num_query_token = 32
+t5_model = 'google/flan-t5-xl'
+drop_path_rate = 0
+use_grad_checkpoint = False
+vit_precision = "fp16"
+freeze_vit = True
+prompt = ''
+max_txt_len = 77
+answer_num = 5
+apply_lemmatizer = False
+task = 'freeze_loc_freeze_qa_vid'
+# prompt
+LOC_propmpt = 'Does the information within the frame provide the necessary details to accurately answer the given question?'
+QA_prompt = 'Considering the information presented in the frame, select the correct answer from the options.'
+# processors config
+mean = (0.48145466, 0.4578275, 0.40821073)
+std = (0.26862954, 0.26130258, 0.27577711)
+normalize = transforms.Normalize(mean, std)
+image_size = img_size
+transform = transforms.Compose([ToUint8(), ToTHWC(), transforms_video.ToTensorVideo(), normalize])
+print('model loading')
+sevila = SeViLA(
+    img_size=img_size,
+    drop_path_rate=drop_path_rate,
+    use_grad_checkpoint=use_grad_checkpoint,
+    vit_precision=vit_precision,
+    freeze_vit=freeze_vit,
+    num_query_token=num_query_token,
+    t5_model=t5_model,
+    prompt=prompt,
+    max_txt_len=max_txt_len,
+    apply_lemmatizer=apply_lemmatizer,
+    frame_num=4,
+    answer_num=answer_num,
+    task=task,
+        )
+sevila.load_checkpoint(url_or_filename='https://huggingface.co/Shoubin/SeViLA/resolve/main/sevila_pretrained.pth')
+print('model loaded')
+ANS_MAPPING = {0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D', 4 : 'E'}
+# os.mkdir('video')
+def sevila_demo(video,
+    question,
+    option1, option2, option3,
+    video_frame_num,
+    keyframe_num):
+    if torch.cuda.is_available():
+        device = 0
+    else:
+        device = 'cpu'
+    global sevila
+    if device == "cpu":
+        sevila = sevila.float()
+    else:
+        sevila = sevila.to(int(device))
+    vpath = video
+    raw_clip, indice, fps, vlen = load_video_demo(
+        video_path=vpath,
+        n_frms=int(video_frame_num),
+        height=image_size,
+        width=image_size,
+        sampling="uniform",
+        clip_proposal=None
+    )
+    clip = transform(raw_clip.permute(1,0,2,3))
+    clip = clip.float().to(int(device))
+    clip = clip.unsqueeze(0)
+    # check
+    if option1[-1] != '.':
+        option1 += '.'
+    if option2[-1] != '.':
+        option2 += '.'
+    if option3[-1] != '.':
+        option3 += '.'
+    option_dict = {0:option1, 1:option2, 2:option3}
+    options = 'Option A:{} Option B:{} Option C:{}'.format(option1, option2, option3)
+    text_input_qa = 'Question: ' + question + ' ' + options + ' ' + QA_prompt
+    text_input_loc = 'Question: ' + question + ' ' + options + ' ' + LOC_propmpt
+    out = sevila.generate_demo(clip, text_input_qa, text_input_loc, int(keyframe_num))
+    # print(out)
+    answer_id = out['output_text'][0]
+    answer = option_dict[answer_id]
+    select_index = out['frame_idx'][0]
+    # images = []
+    keyframes = []
+    timestamps =[]
+    # print('raw_clip', len(raw_clip))
+    # for j in range(int(video_frame_num)):
+    #     image = raw_clip[:, j, :, :].int()
+    #     image = image.permute(1, 2, 0).numpy()
+    #     images.append(image)
+    video_len = vlen/fps # seconds
+    for i in select_index:
+        image = raw_clip[:, i, :, :].int()
+        image = image.permute(1, 2, 0).numpy()
+        keyframes.append(image)
+        select_i = indice[i]
+        time = round((select_i / vlen) * video_len, 2)
+        timestamps.append(str(time)+'s')
+    gr.components.Gallery(keyframes)
+    #gr.components.Gallery(images)
+    timestamps_des = ''
+    for i in range(len(select_index)):
+        timestamps_des += 'Keyframe {}: {} \n'.format(str(i+1), timestamps[i])
+    return keyframes, timestamps_des, answer
+with gr.Blocks(title="SeViLA demo") as demo:
+    description = """<p style="text-align: center; font-weight: bold;">
+        <span style="font-size: 28px">Self-Chained Image-Language Model for Video Localization and Question Answering</span>
+        <br>
+        <span style="font-size: 18px" id="author-info">
+            <a href="https://yui010206.github.io/" target="_blank">Shoubin Yu</a>,
+            <a href="https://j-min.io/" target="_blank">Jaemin Cho</a>,
+            <a href="https://prateek-yadav.github.io/" target="_blank">Prateek Yadav</a>,
+            <a href="https://www.cs.unc.edu/~mbansal/" target="_blank">Mohit Bansal</a>
+        </span>
+        <br>
+        <span style="font-size: 18px" id="paper-info">
+            [<a href="https://github.com/Yui010206/SeViLA" target="_blank">GitHub</a>]
+            [<a href="https://arxiv.org/abs/2305.06988" target="_blank">Paper</a>]
+        </span>
+    </p>
+    <p>
+        To locate keyframes in a video and answer question, please:
+        <br>
+        (1) upolad your video; (2) write your question/options and set # video frame/# keyframe/running device; (3) click Locate and Answer!
+        <br>
+        Just a heads up - loading the SeViLA model can take a few minutes (typically 2-3), and running examples requires about 12GB of memory.
+        <br>
+        We've got you covered! We've provided some example videos and questions below to help you get started. Feel free to try out SeViLA with these!
+    </p>
+    """
+    gr.HTML(description)
+    with gr.Row():
+        with gr.Column(scale=1, min_width=600):
+            video = gr.Video(label='Video')
+            question = gr.Textbox(placeholder="Why did the two ladies put their hands above their eyes while staring out?", label='Question')
+            with gr.Row():
+                option1 = gr.Textbox(placeholder="practicing cheer", label='Option 1')
+                option2 = gr.Textbox(placeholder="posing for photo", label='Option 2')
+                option3 = gr.Textbox(placeholder="to see better", label='Option 3')
+            video_frame_num = gr.Textbox(placeholder=32, label='# Video Frame')
+            keyframe_num = gr.Textbox(placeholder=4, label='# Keyframe')
+            # device = gr.Textbox(placeholder=0, label='Device')
+            gen_btn = gr.Button(value='Locate and Answer!')
+        with gr.Column(scale=2, min_width=600):
+            keyframes = gr.Gallery(
+                label="Keyframes", show_label=False, elem_id="gallery"
+                ).style(columns=[4], rows=[1], object_fit="contain", height="auto")
+            #keyframes = gr.Gallery(label='Keyframes')
+            timestamps = gr.outputs.Textbox(label="Keyframe Timestamps")
+            answer = gr.outputs.Textbox(label="Output Answer")
+        gen_btn.click(
+            sevila_demo,
+            inputs=[video, question, option1, option2, option3, video_frame_num, keyframe_num],
+            outputs=[keyframes, timestamps, answer],
+            queue=True
+        )
+        #demo = gr.Interface(sevila_demo,
+        #     inputs=[gr.Video(), question, option1, option2, option3, video_frame_num, keyframe_num, device],
+        #     outputs=['gallery', timestamps, answer],
+        #     examples=[['videos/demo1.mp4', 'Why did the two ladies put their hands above their eyes while staring out?', 'practicing cheer.', 'play ball.', 'to see better.', 32, 4, 0],
+        #               ['videos/demo2.mp4', 'What did both of them do after completing skiing?', 'jump and pose.' , 'bend down.','raised their hands.', 32, 4, 0],
+        #               ['videos/demo3.mp4', 'What room was Wilson breaking into when House found him?', 'the kitchen.' , 'the dining room.','the bathroom.', 32, 4, 0]]
+        #     )
+    with gr.Column():
+        gr.Examples(
+            inputs=[video, question, option1, option2, option3, video_frame_num, keyframe_num],
+            outputs=[keyframes, timestamps, answer],
+            fn=sevila_demo,
+            examples=[['videos/demo1.mp4', 'Why did the two ladies put their hands above their eyes while staring out?', 'practicing cheer', 'play ball', 'to see better', 32, 4],
+                      ['videos/demo2.mp4', 'What did both of them do after completing skiing?', 'jump and pose' , 'bend down','raised their hands', 32, 4],
+                      ['videos/demo3.mp4', 'What room was Wilson breaking into when House found him?', 'the kitchen' , 'the dining room','the bathroom', 32, 4],
+                      ['videos/demo4.mp4', 'what kind of bird is it?', 'chikadee' , 'eagle','seagull', 32, 1]],
+            cache_examples=False,
+        )
+demo.queue(concurrency_count=1, api_open=False)
+demo.launch(share=False)

app/__init__.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from PIL import Image
+import requests
+import streamlit as st
+import torch
+@st.cache()
+def load_demo_image():
+    img_url = (
+        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    )
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+cache_root = "/export/home/.cache/lavis/"

app/calculate_coco_features.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from PIL import Image
+import requests
+import torch
+import os
+from lavis.common.registry import registry
+from lavis.processors import *
+from lavis.models import *
+from lavis.common.utils import build_default_model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def load_demo_image():
+    img_url = (
+        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    )
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+def read_img(filepath):
+    raw_image = Image.open(filepath).convert("RGB")
+    return raw_image
+# model
+model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
+feature_extractor = BlipFeatureExtractor(pretrained=model_url)
+feature_extractor.eval()
+feature_extractor = feature_extractor.to(device)
+# preprocessors
+vis_processor = BlipImageEvalProcessor(image_size=224)
+text_processor = BlipCaptionProcessor()
+# files to process
+# file_root = "/export/home/.cache/lavis/coco/images/val2014"
+file_root = "/export/home/.cache/lavis/coco/images/train2014"
+filepaths = os.listdir(file_root)
+print(len(filepaths))
+caption = "dummy"
+path2feat = dict()
+bsz = 256
+images_in_batch = []
+filepaths_in_batch = []
+for i, filename in enumerate(filepaths):
+    if i % bsz == 0 and i > 0:
+        images_in_batch = torch.cat(images_in_batch, dim=0).to(device)
+        with torch.no_grad():
+            image_features = feature_extractor(
+                images_in_batch, caption, mode="image", normalized=True
+            )[:, 0]
+        for filepath, image_feat in zip(filepaths_in_batch, image_features):
+            path2feat[os.path.basename(filepath)] = image_feat.detach().cpu()
+        images_in_batch = []
+        filepaths_in_batch = []
+        print(len(path2feat), image_features.shape)
+    else:
+        filepath = os.path.join(file_root, filename)
+        image = read_img(filepath)
+        image = vis_processor(image).unsqueeze(0)
+        images_in_batch.append(image)
+        filepaths_in_batch.append(filepath)
+torch.save(path2feat, "path2feat_coco_train2014.pth")

app/caption.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import streamlit as st
+from app import device, load_demo_image
+from app.utils import load_model_cache
+from lavis.processors import load_processor
+from PIL import Image
+def app():
+    # ===== layout =====
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    sampling_method = st.sidebar.selectbox(
+        "Sampling method:", ["Beam search", "Nucleus sampling"]
+    )
+    st.markdown(
+        "<h1 style='text-align: center;'>Image Description Generation</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    use_beam = sampling_method == "Beam search"
+    col1, col2 = st.columns(2)
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    col1.header("Image")
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("Description")
+    cap_button = st.button("Generate")
+    # ==== event ====
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    if cap_button:
+        if model_type.startswith("BLIP"):
+            blip_type = model_type.split("_")[1].lower()
+            model = load_model_cache(
+                "blip_caption",
+                model_type=f"{blip_type}_coco",
+                is_eval=True,
+                device=device,
+            )
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        captions = generate_caption(
+            model=model, image=img, use_nucleus_sampling=not use_beam
+        )
+        col2.write("\n\n".join(captions), use_column_width=True)
+def generate_caption(
+    model, image, use_nucleus_sampling=False, num_beams=3, max_length=40, min_length=5
+):
+    samples = {"image": image}
+    captions = []
+    if use_nucleus_sampling:
+        for _ in range(5):
+            caption = model.generate(
+                samples,
+                use_nucleus_sampling=True,
+                max_length=max_length,
+                min_length=min_length,
+                top_p=0.9,
+            )
+            captions.append(caption[0])
+    else:
+        caption = model.generate(
+            samples,
+            use_nucleus_sampling=False,
+            num_beams=num_beams,
+            max_length=max_length,
+            min_length=min_length,
+        )
+        captions.append(caption[0])
+    return captions

app/classification.py ADDED Viewed

	@@ -0,0 +1,216 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import plotly.graph_objects as go
+import requests
+import streamlit as st
+import torch
+from lavis.models import load_model
+from lavis.processors import load_processor
+from lavis.processors.blip_processors import BlipCaptionProcessor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import load_blip_itm_model
+from lavis.processors.clip_processors import ClipImageEvalProcessor
+@st.cache()
+def load_demo_image(img_url=None):
+    if not img_url:
+        img_url = "https://img.atlasobscura.com/yDJ86L8Ou6aIjBsxnlAy5f164w1rjTgcHZcx2yUs4mo/rt:fit/w:1200/q:81/sm:1/scp:1/ar:1/aHR0cHM6Ly9hdGxh/cy1kZXYuczMuYW1h/em9uYXdzLmNvbS91/cGxvYWRzL3BsYWNl/X2ltYWdlcy85MDll/MDRjOS00NTJjLTQx/NzQtYTY4MS02NmQw/MzI2YWIzNjk1ZGVk/MGZhMTJiMTM5MmZi/NGFfUmVhcl92aWV3/X29mX3RoZV9NZXJs/aW9uX3N0YXR1ZV9h/dF9NZXJsaW9uX1Bh/cmssX1NpbmdhcG9y/ZSxfd2l0aF9NYXJp/bmFfQmF5X1NhbmRz/X2luX3RoZV9kaXN0/YW5jZV8tXzIwMTQw/MzA3LmpwZw.jpg"
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_model_cache(model_type, device):
+    if model_type == "blip":
+        model = load_model(
+            "blip_feature_extractor", model_type="base", is_eval=True, device=device
+        )
+    elif model_type == "albef":
+        model = load_model(
+            "albef_feature_extractor", model_type="base", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-B-32":
+        model = load_model(
+            "clip_feature_extractor", "ViT-B-32", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-B-16":
+        model = load_model(
+            "clip_feature_extractor", "ViT-B-16", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-L-14":
+        model = load_model(
+            "clip_feature_extractor", "ViT-L-14", is_eval=True, device=device
+        )
+    return model
+def app():
+    model_type = st.sidebar.selectbox(
+        "Model:",
+        ["ALBEF", "BLIP_Base", "CLIP_ViT-B-32", "CLIP_ViT-B-16", "CLIP_ViT-L-14"],
+    )
+    score_type = st.sidebar.selectbox("Score type:", ["Cosine", "Multimodal"])
+    # ===== layout =====
+    st.markdown(
+        "<h1 style='text-align: center;'>Zero-shot Classification</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    st.header("Image")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    st.image(raw_img)  # , use_column_width=True)
+    col1, col2 = st.columns(2)
+    col1.header("Categories")
+    cls_0 = col1.text_input("category 1", value="merlion")
+    cls_1 = col1.text_input("category 2", value="sky")
+    cls_2 = col1.text_input("category 3", value="giraffe")
+    cls_3 = col1.text_input("category 4", value="fountain")
+    cls_4 = col1.text_input("category 5", value="marina bay")
+    cls_names = [cls_0, cls_1, cls_2, cls_3, cls_4]
+    cls_names = [cls_nm for cls_nm in cls_names if len(cls_nm) > 0]
+    if len(cls_names) != len(set(cls_names)):
+        st.error("Please provide unique class names")
+        return
+    button = st.button("Submit")
+    col2.header("Prediction")
+    # ===== event =====
+    if button:
+        if model_type.startswith("BLIP"):
+            text_processor = BlipCaptionProcessor(prompt="A picture of ")
+            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
+            if score_type == "Cosine":
+                vis_processor = load_processor("blip_image_eval").build(image_size=224)
+                img = vis_processor(raw_img).unsqueeze(0).to(device)
+                feature_extractor = load_model_cache(model_type="blip", device=device)
+                sample = {"image": img, "text_input": cls_prompt}
+                with torch.no_grad():
+                    image_features = feature_extractor.extract_features(
+                        sample, mode="image"
+                    ).image_embeds_proj[:, 0]
+                    text_features = feature_extractor.extract_features(
+                        sample, mode="text"
+                    ).text_embeds_proj[:, 0]
+                    sims = (image_features @ text_features.t())[
+                        0
+                    ] / feature_extractor.temp
+            else:
+                vis_processor = load_processor("blip_image_eval").build(image_size=384)
+                img = vis_processor(raw_img).unsqueeze(0).to(device)
+                model = load_blip_itm_model(device)
+                output = model(img, cls_prompt, match_head="itm")
+                sims = output[:, 1]
+            sims = torch.nn.Softmax(dim=0)(sims)
+            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
+        elif model_type.startswith("ALBEF"):
+            vis_processor = load_processor("blip_image_eval").build(image_size=224)
+            img = vis_processor(raw_img).unsqueeze(0).to(device)
+            text_processor = BlipCaptionProcessor(prompt="A picture of ")
+            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
+            feature_extractor = load_model_cache(model_type="albef", device=device)
+            sample = {"image": img, "text_input": cls_prompt}
+            with torch.no_grad():
+                image_features = feature_extractor.extract_features(
+                    sample, mode="image"
+                ).image_embeds_proj[:, 0]
+                text_features = feature_extractor.extract_features(
+                    sample, mode="text"
+                ).text_embeds_proj[:, 0]
+                st.write(image_features.shape)
+                st.write(text_features.shape)
+                sims = (image_features @ text_features.t())[0] / feature_extractor.temp
+            sims = torch.nn.Softmax(dim=0)(sims)
+            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
+        elif model_type.startswith("CLIP"):
+            if model_type == "CLIP_ViT-B-32":
+                model = load_model_cache(model_type="CLIP_ViT-B-32", device=device)
+            elif model_type == "CLIP_ViT-B-16":
+                model = load_model_cache(model_type="CLIP_ViT-B-16", device=device)
+            elif model_type == "CLIP_ViT-L-14":
+                model = load_model_cache(model_type="CLIP_ViT-L-14", device=device)
+            else:
+                raise ValueError(f"Unknown model type {model_type}")
+            if score_type == "Cosine":
+                # image_preprocess = ClipImageEvalProcessor(image_size=336)
+                image_preprocess = ClipImageEvalProcessor(image_size=224)
+                img = image_preprocess(raw_img).unsqueeze(0).to(device)
+                sample = {"image": img, "text_input": cls_names}
+                with torch.no_grad():
+                    clip_features = model.extract_features(sample)
+                    image_features = clip_features.image_embeds_proj
+                    text_features = clip_features.text_embeds_proj
+                    sims = (100.0 * image_features @ text_features.T)[0].softmax(dim=-1)
+                    inv_sims = sims.tolist()[::-1]
+            else:
+                st.warning("CLIP does not support multimodal scoring.")
+                return
+        fig = go.Figure(
+            go.Bar(
+                x=inv_sims,
+                y=cls_names[::-1],
+                text=["{:.2f}".format(s) for s in inv_sims],
+                orientation="h",
+            )
+        )
+        fig.update_traces(
+            textfont_size=12,
+            textangle=0,
+            textposition="outside",
+            cliponaxis=False,
+        )
+        col2.plotly_chart(fig, use_container_width=True)

app/dataset_browser.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import random
+from collections import OrderedDict
+from functools import reduce
+from tkinter import N
+import streamlit as st
+from lavis.common.registry import registry
+from lavis.datasets.builders import dataset_zoo, load_dataset
+from lavis.datasets.builders.base_dataset_builder import load_dataset_config
+from PIL import Image
+IMAGE_LAYOUT = 3, 4
+VIDEO_LAYOUT = 1, 2
+PREV_STR = "Prev"
+NEXT_STR = "Next"
+def sample_dataset(dataset, indices):
+    samples = [dataset.displ_item(idx) for idx in indices]
+    return samples
+def get_concat_v(im1, im2):
+    margin = 5
+    canvas_size = (im1.width + im2.width + margin, max(im1.height, im2.height))
+    canvas = Image.new("RGB", canvas_size, "White")
+    canvas.paste(im1, (0, 0))
+    canvas.paste(im2, (im1.width + margin, 0))
+    return canvas
+def resize_img_w(raw_img, new_w=224):
+    if isinstance(raw_img, list):
+        resized_imgs = [resize_img_w(img, 196) for img in raw_img]
+        # concatenate images
+        resized_image = reduce(get_concat_v, resized_imgs)
+    else:
+        w, h = raw_img.size
+        scaling_factor = new_w / w
+        resized_image = raw_img.resize(
+            (int(w * scaling_factor), int(h * scaling_factor))
+        )
+    return resized_image
+def get_visual_key(dataset):
+    if "image" in dataset[0]:
+        return "image"
+    elif "image0" in dataset[0]:  # NLVR2 dataset
+        return "image"
+    elif "video" in dataset[0]:
+        return "video"
+    else:
+        raise ValueError("Visual key not found.")
+def gather_items(samples, exclude=[]):
+    gathered = []
+    for s in samples:
+        ns = OrderedDict()
+        for k in s.keys():
+            if k not in exclude:
+                ns[k] = s[k]
+        gathered.append(ns)
+    return gathered
+@st.cache(allow_output_mutation=True)
+def load_dataset_cache(name):
+    return load_dataset(name)
+def format_text(text):
+    md = "\n\n".join([f"**{k}**: {v}" for k, v in text.items()])
+    return md
+def show_samples(dataset, offset=0, is_next=False):
+    visual_key = get_visual_key(dataset)
+    num_rows, num_cols = IMAGE_LAYOUT if visual_key == "image" else VIDEO_LAYOUT
+    n_samples = num_rows * num_cols
+    if not shuffle:
+        if is_next:
+            start = min(int(start_idx) + offset + n_samples, len(dataset) - n_samples)
+        else:
+            start = max(0, int(start_idx) + offset - n_samples)
+        st.session_state.last_start = start
+        end = min(start + n_samples, len(dataset))
+        indices = list(range(start, end))
+    else:
+        indices = random.sample(range(len(dataset)), n_samples)
+    samples = sample_dataset(dataset, indices)
+    visual_info = (
+        iter([resize_img_w(s[visual_key]) for s in samples])
+        if visual_key == "image"
+        # else iter([s[visual_key] for s in samples])
+        else iter([s["file"] for s in samples])
+    )
+    text_info = gather_items(samples, exclude=["image", "video"])
+    text_info = iter([format_text(s) for s in text_info])
+    st.markdown(
+        """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
+        unsafe_allow_html=True,
+    )
+    for _ in range(num_rows):
+        with st.container():
+            for col in st.columns(num_cols):
+                # col.text(next(text_info))
+                # col.caption(next(text_info))
+                try:
+                    col.markdown(next(text_info))
+                    if visual_key == "image":
+                        col.image(next(visual_info), use_column_width=True, clamp=True)
+                    elif visual_key == "video":
+                        col.markdown(
+                            "![Alt Text](https://media.giphy.com/media/vFKqnCdLPNOKc/giphy.gif)"
+                        )
+                except StopIteration:
+                    break
+            st.markdown(
+                """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
+                unsafe_allow_html=True,
+            )
+    st.session_state.n_display = n_samples
+if __name__ == "__main__":
+    st.set_page_config(
+        page_title="LAVIS Dataset Explorer",
+        # layout="wide",
+        initial_sidebar_state="expanded",
+    )
+    dataset_name = st.sidebar.selectbox("Dataset:", dataset_zoo.get_names())
+    function = st.sidebar.selectbox("Function:", ["Browser"], index=0)
+    if function == "Browser":
+        shuffle = st.sidebar.selectbox("Shuffled:", [True, False], index=0)
+        dataset = load_dataset_cache(dataset_name)
+        split = st.sidebar.selectbox("Split:", dataset.keys())
+        dataset_len = len(dataset[split])
+        st.success(
+            f"Loaded {dataset_name}/{split} with **{dataset_len}** records.  **Image/video directory**: {dataset[split].vis_root}"
+        )
+        if "last_dataset" not in st.session_state:
+            st.session_state.last_dataset = dataset_name
+            st.session_state.last_split = split
+        if "last_start" not in st.session_state:
+            st.session_state.last_start = 0
+        if "start_idx" not in st.session_state:
+            st.session_state.start_idx = 0
+        if "shuffle" not in st.session_state:
+            st.session_state.shuffle = shuffle
+        if "first_run" not in st.session_state:
+            st.session_state.first_run = True
+        elif (
+            st.session_state.last_dataset != dataset_name
+            or st.session_state.last_split != split
+        ):
+            st.session_state.first_run = True
+            st.session_state.last_dataset = dataset_name
+            st.session_state.last_split = split
+        elif st.session_state.shuffle != shuffle:
+            st.session_state.shuffle = shuffle
+            st.session_state.first_run = True
+        if not shuffle:
+            n_col, p_col = st.columns([0.05, 1])
+            prev_button = n_col.button(PREV_STR)
+            next_button = p_col.button(NEXT_STR)
+        else:
+            next_button = st.button(NEXT_STR)
+        if not shuffle:
+            start_idx = st.sidebar.text_input(f"Begin from (total {dataset_len})", 0)
+            if not start_idx.isdigit():
+                st.error(f"Input to 'Begin from' must be digits, found {start_idx}.")
+            else:
+                if int(start_idx) != st.session_state.start_idx:
+                    st.session_state.start_idx = int(start_idx)
+                    st.session_state.last_start = int(start_idx)
+            if prev_button:
+                show_samples(
+                    dataset[split],
+                    offset=st.session_state.last_start - st.session_state.start_idx,
+                    is_next=False,
+                )
+        if next_button:
+            show_samples(
+                dataset[split],
+                offset=st.session_state.last_start - st.session_state.start_idx,
+                is_next=True,
+            )
+        if st.session_state.first_run:
+            st.session_state.first_run = False
+            show_samples(
+                dataset[split],
+                offset=st.session_state.last_start - st.session_state.start_idx,
+                is_next=True,
+            )

app/image_text_match.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import numpy as np
+import streamlit as st
+import torch
+from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
+from lavis.processors import load_processor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    if model_type.startswith("BLIP"):
+        blip_type = model_type.split("_")[1]
+        model = load_blip_itm_model(device, model_type=blip_type)
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    st.markdown(
+        "<h1 style='text-align: center;'>Image Text Matching</h1>",
+        unsafe_allow_html=True,
+    )
+    values = list(range(1, 12))
+    default_layer_num = values.index(7)
+    layer_num = (
+        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    col1, col2 = st.columns(2)
+    col1.header("Image")
+    col2.header("GradCam")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col3, col4 = st.columns(2)
+    col3.header("Text")
+    user_question = col3.text_input(
+        "Input your sentence!", "a woman sitting on the beach with a dog"
+    )
+    submit_button = col3.button("Submit")
+    col4.header("Matching score")
+    if submit_button:
+        tokenizer = init_bert_tokenizer()
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        text_processor = load_processor("blip_caption").build()
+        qry = text_processor(user_question)
+        norm_img = np.float32(resized_image) / 255
+        qry_tok = tokenizer(qry, return_tensors="pt").to(device)
+        gradcam, output = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
+        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
+        col2.image(avg_gradcam, use_column_width=True, clamp=True)
+        # output = model(img, question)
+        itm_score = torch.nn.functional.softmax(output, dim=1)
+        new_title = (
+            '<p style="text-align: left; font-size: 25px;">\n{:.3f}%</p>'.format(
+                itm_score[0][1].item() * 100
+            )
+        )
+        col4.markdown(new_title, unsafe_allow_html=True)

app/main.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from app.multipage import MultiPage
+from app import vqa, caption
+from app import image_text_match as itm
+from app import text_localization as tl
+from app import multimodal_search as ms
+from app import classification as cl
+if __name__ == "__main__":
+    app = MultiPage()
+    app.add_page("Image Description Generation", caption.app)
+    app.add_page("Multimodal Search", ms.app)
+    app.add_page("Visual Question Answering", vqa.app)
+    app.add_page("Image Text Matching", itm.app)
+    app.add_page("Text Localization", tl.app)
+    app.add_page("Classification", cl.app)
+    app.run()

app/multimodal_search.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import os
+import numpy as np
+import streamlit as st
+import torch
+import torch.nn.functional as F
+from app import cache_root, device
+from app.utils import (
+    getAttMap,
+    init_bert_tokenizer,
+    load_blip_itm_model,
+    read_img,
+    resize_img,
+)
+from lavis.models import load_model
+from lavis.processors import load_processor
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_feat():
+    from lavis.common.utils import download_url
+    dirname = os.path.join(os.path.dirname(__file__), "assets")
+    filename = "path2feat_coco_train2014.pth"
+    filepath = os.path.join(dirname, filename)
+    url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/path2feat_coco_train2014.pth"
+    if not os.path.exists(filepath):
+        download_url(url=url, root=dirname, filename="path2feat_coco_train2014.pth")
+    path2feat = torch.load(filepath)
+    paths = sorted(path2feat.keys())
+    all_img_feats = torch.stack([path2feat[k] for k in paths], dim=0).to(device)
+    return path2feat, paths, all_img_feats
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_feature_extractor_model(device):
+    model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
+    model = load_model(
+        "blip_feature_extractor", model_type="base", is_eval=True, device=device
+    )
+    model.load_from_pretrained(model_url)
+    return model
+def app():
+    # === layout ===
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    file_root = os.path.join(cache_root, "coco/images/train2014/")
+    values = [12, 24, 48]
+    default_layer_num = values.index(24)
+    num_display = st.sidebar.selectbox(
+        "Number of images:", values, index=default_layer_num
+    )
+    show_gradcam = st.sidebar.selectbox("Show GradCam:", [True, False], index=1)
+    itm_ranking = st.sidebar.selectbox("Multimodal re-ranking:", [True, False], index=0)
+    # st.title('Multimodal Search')
+    st.markdown(
+        "<h1 style='text-align: center;'>Multimodal Search</h1>", unsafe_allow_html=True
+    )
+    # === event ===
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    text_processor = load_processor("blip_caption")
+    user_question = st.text_input(
+        "Search query", "A dog running on the grass.", help="Type something to search."
+    )
+    user_question = text_processor(user_question)
+    feature_extractor = load_feature_extractor_model(device)
+    # ======= ITC =========
+    sample = {"text_input": user_question}
+    with torch.no_grad():
+        text_feature = feature_extractor.extract_features(
+            sample, mode="text"
+        ).text_embeds_proj[0, 0]
+        path2feat, paths, all_img_feats = load_feat()
+        all_img_feats.to(device)
+        all_img_feats = F.normalize(all_img_feats, dim=1)
+        num_cols = 4
+        num_rows = int(num_display / num_cols)
+        similarities = text_feature @ all_img_feats.T
+        indices = torch.argsort(similarities, descending=True)[:num_display]
+    top_paths = [paths[ind.detach().cpu().item()] for ind in indices]
+    sorted_similarities = [similarities[idx] for idx in indices]
+    filenames = [os.path.join(file_root, p) for p in top_paths]
+    # ========= ITM and GradCam ==========
+    bsz = 4  # max number of images to avoid cuda oom
+    if model_type.startswith("BLIP"):
+        blip_type = model_type.split("_")[1]
+    itm_model = load_blip_itm_model(device, model_type=blip_type)
+    tokenizer = init_bert_tokenizer()
+    queries_batch = [user_question] * bsz
+    queries_tok_batch = tokenizer(queries_batch, return_tensors="pt").to(device)
+    num_batches = int(num_display / bsz)
+    avg_gradcams = []
+    all_raw_images = []
+    itm_scores = []
+    for i in range(num_batches):
+        filenames_in_batch = filenames[i * bsz : (i + 1) * bsz]
+        raw_images, images = read_and_process_images(filenames_in_batch, vis_processor)
+        gradcam, itm_output = compute_gradcam_batch(
+            itm_model, images, queries_batch, queries_tok_batch
+        )
+        all_raw_images.extend([resize_img(r_img) for r_img in raw_images])
+        norm_imgs = [np.float32(r_img) / 255 for r_img in raw_images]
+        for norm_img, grad_cam in zip(norm_imgs, gradcam):
+            avg_gradcam = getAttMap(norm_img, grad_cam[0], blur=True)
+            avg_gradcams.append(avg_gradcam)
+        with torch.no_grad():
+            itm_score = torch.nn.functional.softmax(itm_output, dim=1)
+        itm_scores.append(itm_score)
+    # ========= ITM re-ranking =========
+    itm_scores = torch.cat(itm_scores)[:, 1]
+    if itm_ranking:
+        itm_scores_sorted, indices = torch.sort(itm_scores, descending=True)
+        avg_gradcams_sorted = []
+        all_raw_images_sorted = []
+        for idx in indices:
+            avg_gradcams_sorted.append(avg_gradcams[idx])
+            all_raw_images_sorted.append(all_raw_images[idx])
+        avg_gradcams = avg_gradcams_sorted
+        all_raw_images = all_raw_images_sorted
+    if show_gradcam:
+        images_to_show = iter(avg_gradcams)
+    else:
+        images_to_show = iter(all_raw_images)
+    for _ in range(num_rows):
+        with st.container():
+            for col in st.columns(num_cols):
+                col.image(next(images_to_show), use_column_width=True, clamp=True)
+def read_and_process_images(image_paths, vis_processor):
+    raw_images = [read_img(path) for path in image_paths]
+    images = [vis_processor(r_img) for r_img in raw_images]
+    images_tensors = torch.stack(images).to(device)
+    return raw_images, images_tensors
+def compute_gradcam_batch(model, visual_input, text_input, tokenized_text, block_num=6):
+    model.text_encoder.base_model.base_model.encoder.layer[
+        block_num
+    ].crossattention.self.save_attention = True
+    output = model({"image": visual_input, "text_input": text_input}, match_head="itm")
+    loss = output[:, 1].sum()
+    model.zero_grad()
+    loss.backward()
+    with torch.no_grad():
+        mask = tokenized_text.attention_mask.view(
+            tokenized_text.attention_mask.size(0), 1, -1, 1, 1
+        )  # (bsz,1,token_len, 1,1)
+        token_length = mask.sum() - 2
+        token_length = token_length.cpu()
+        # grads and cams [bsz, num_head, seq_len, image_patch]
+        grads = model.text_encoder.base_model.base_model.encoder.layer[
+            block_num
+        ].crossattention.self.get_attn_gradients()
+        cams = model.text_encoder.base_model.base_model.encoder.layer[
+            block_num
+        ].crossattention.self.get_attention_map()
+        # assume using vit large with 576 num image patch
+        cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
+        grads = (
+            grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24)
+            * mask
+        )
+        gradcam = cams * grads
+        # [enc token gradcam, average gradcam across token, gradcam for individual token]
+        # gradcam = torch.cat((gradcam[0:1,:], gradcam[1:token_length+1, :].sum(dim=0, keepdim=True)/token_length, gradcam[1:, :]))
+        gradcam = gradcam.mean(1).cpu().detach()
+        gradcam = (
+            gradcam[:, 1 : token_length + 1, :].sum(dim=1, keepdim=True) / token_length
+        )
+    return gradcam, output

app/multipage.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+"""
+This file is the framework for generating multiple Streamlit applications
+through an object oriented framework.
+"""
+# Import necessary libraries
+import streamlit as st
+# Define the multipage class to manage the multiple apps in our program
+class MultiPage:
+    """Framework for combining multiple streamlit applications."""
+    def __init__(self) -> None:
+        """Constructor class to generate a list which will store all our applications as an instance variable."""
+        self.pages = []
+    def add_page(self, title, func) -> None:
+        """Class Method to Add pages to the project
+        Args:
+            title ([str]): The title of page which we are adding to the list of apps
+            func: Python function to render this page in Streamlit
+        """
+        self.pages.append({"title": title, "function": func})
+    def run(self):
+        # Drodown to select the page to run
+        page = st.sidebar.selectbox(
+            "Navigation", self.pages, format_func=lambda page: page["title"]
+        )
+        # run the app function
+        page["function"]()

app/text_localization.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import math
+import numpy as np
+import streamlit as st
+from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
+from lavis.processors import load_processor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    values = list(range(1, 12))
+    default_layer_num = values.index(7)
+    layer_num = (
+        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
+    )
+    st.markdown(
+        "<h1 style='text-align: center;'>Text Localization</h1>", unsafe_allow_html=True
+    )
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    text_processor = load_processor("blip_caption")
+    tokenizer = init_bert_tokenizer()
+    instructions = "Try the provided image and text or use your own ones."
+    file = st.file_uploader(instructions)
+    query = st.text_input(
+        "Try a different input.", "A girl playing with her dog on the beach."
+    )
+    submit_button = st.button("Submit")
+    col1, col2 = st.columns(2)
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    col1.header("Image")
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("GradCam")
+    if submit_button:
+        if model_type.startswith("BLIP"):
+            blip_type = model_type.split("_")[1]
+            model = load_blip_itm_model(device, model_type=blip_type)
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        qry = text_processor(query)
+        qry_tok = tokenizer(qry, return_tensors="pt").to(device)
+        norm_img = np.float32(resized_image) / 255
+        gradcam, _ = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
+        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
+        col2.image(avg_gradcam, use_column_width=True, clamp=True)
+        num_cols = 4.0
+        num_tokens = len(qry_tok.input_ids[0]) - 2
+        num_rows = int(math.ceil(num_tokens / num_cols))
+        gradcam_iter = iter(gradcam[0][2:-1])
+        token_id_iter = iter(qry_tok.input_ids[0][1:-1])
+        for _ in range(num_rows):
+            with st.container():
+                for col in st.columns(int(num_cols)):
+                    token_id = next(token_id_iter, None)
+                    if not token_id:
+                        break
+                    gradcam_img = next(gradcam_iter)
+                    word = tokenizer.decode([token_id])
+                    gradcam_todraw = getAttMap(norm_img, gradcam_img, blur=True)
+                    new_title = (
+                        '<p style="text-align: center; font-size: 25px;">{}</p>'.format(
+                            word
+                        )
+                    )
+                    col.markdown(new_title, unsafe_allow_html=True)
+                    # st.image(image, channels="BGR")
+                    col.image(gradcam_todraw, use_column_width=True, clamp=True)

app/utils.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import numpy as np
+import streamlit as st
+import torch
+from lavis.models import BlipBase, load_model
+from matplotlib import pyplot as plt
+from PIL import Image
+from scipy.ndimage import filters
+from skimage import transform as skimage_transform
+def resize_img(raw_img):
+    w, h = raw_img.size
+    scaling_factor = 240 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    return resized_image
+def read_img(filepath):
+    raw_image = Image.open(filepath).convert("RGB")
+    return raw_image
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_model_cache(name, model_type, is_eval, device):
+    return load_model(name, model_type, is_eval, device)
+@st.cache(allow_output_mutation=True)
+def init_bert_tokenizer():
+    tokenizer = BlipBase.init_tokenizer()
+    return tokenizer
+def getAttMap(img, attMap, blur=True, overlap=True):
+    attMap -= attMap.min()
+    if attMap.max() > 0:
+        attMap /= attMap.max()
+    attMap = skimage_transform.resize(attMap, (img.shape[:2]), order=3, mode="constant")
+    if blur:
+        attMap = filters.gaussian_filter(attMap, 0.02 * max(img.shape[:2]))
+        attMap -= attMap.min()
+        attMap /= attMap.max()
+    cmap = plt.get_cmap("jet")
+    attMapV = cmap(attMap)
+    attMapV = np.delete(attMapV, 3, 2)
+    if overlap:
+        attMap = (
+            1 * (1 - attMap**0.7).reshape(attMap.shape + (1,)) * img
+            + (attMap**0.7).reshape(attMap.shape + (1,)) * attMapV
+        )
+    return attMap
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_blip_itm_model(device, model_type="base"):
+    model = load_model(
+        "blip_image_text_matching", model_type, is_eval=True, device=device
+    )
+    return model

app/vqa.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import streamlit as st
+from app import load_demo_image, device
+from app.utils import load_model_cache
+from lavis.processors import load_processor
+from PIL import Image
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP"])
+    # ===== layout =====
+    st.markdown(
+        "<h1 style='text-align: center;'>Visual Question Answering</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    col1, col2 = st.columns(2)
+    col1.header("Image")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("Question")
+    user_question = col2.text_input("Input your question!", "What are objects there?")
+    qa_button = st.button("Submit")
+    col2.header("Answer")
+    # ===== event =====
+    vis_processor = load_processor("blip_image_eval").build(image_size=480)
+    text_processor = load_processor("blip_question").build()
+    if qa_button:
+        if model_type.startswith("BLIP"):
+            model = load_model_cache(
+                "blip_vqa", model_type="vqav2", is_eval=True, device=device
+            )
+            img = vis_processor(raw_img).unsqueeze(0).to(device)
+            question = text_processor(user_question)
+            vqa_samples = {"image": img, "text_input": [question]}
+            answers = model.predict_answers(vqa_samples, inference_method="generate")
+            col2.write("\n".join(answers), use_column_width=True)

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/chain.png ADDED Viewed

assets/model.png ADDED Viewed

assets/teaser.png ADDED Viewed

docs/.DS_Store ADDED Viewed

Binary file (8.2 kB). View file

docs/Makefile ADDED Viewed

	@@ -0,0 +1,20 @@

+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/_static/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

docs/_static/architecture.png ADDED Viewed

docs/_static/logo_final.png ADDED Viewed

docs/benchmark.rst ADDED Viewed

	@@ -0,0 +1,348 @@

+Benchmark
+############
+We provide scripts for evaluating and training models on task datasets. The following benchmark results are included for reference.
+ALBEF
+*******
+.. list-table::
+   :widths: 30 80 20
+   * - **Pretraining**
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/pretrain.sh>`__
+   * -
+     - Visual Genome (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_vg.py>`__)
+     -
+   * -
+     - SBU (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_sbu.py>`__)
+     -
+   * -
+     - CC3M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc3m.py>`__)
+     -
+   * -
+     - CC12M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc12m.py>`__)
+     -
+.. list-table::
+   :widths: 30 40 20 20 20 30 30
+   :header-rows: 1
+   * -
+     - **Retrieval**
+     - **R1**
+     - **R5**
+     - **R10**
+     - **Training**
+     - **Evaluation**
+   * - TR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 77.6
+     - 94.1
+     - 97.2
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_coco_retrieval_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_coco_retrieval.sh>`__
+   * - IR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 61.0
+     - 84.5
+     - 90.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_coco_retrieval_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_coco_retrieval.sh>`__
+   * - TR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 77.6
+     - 94.1
+     - 97.2
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_flickr30k_retrieval_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_flickr30k_retrieval.sh>`__
+   * - IR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 61.0
+     - 84.5
+     - 90.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_flickr30k_retrieval_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_flickr30k_retrieval.sh>`__
+.. list-table::
+   :widths: 20 20 20 20 20
+   :header-rows: 1
+   * - **VQA**
+     - **test-dev**
+     - **test-std/test**
+     - **Training**
+     - **Evaluation**
+   * - VQAv2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 76.35
+     - 76.54
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_vqa_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/test_albef_vqa.sh>`__
+   * - OKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - NA
+     - 54.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_okvqa_albef.sh>`__
+     - NA
+   * - AOKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 54.5
+     - NA
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_aokvqa_albef.sh>`__
+     - NA
+.. list-table::
+   :widths: 20 20 20 20 20
+   :header-rows: 1
+   * - **Multimodal Classification**
+     - **val**
+     - **test**
+     - **Training**
+     - **Evaluation**
+   * - SNLI-VE (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 80.60
+     - 81.04
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_ve_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_albef_ve.sh>`__
+   * - NLVR2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 82.47
+     - 82.91
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_nlvr_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_albef_nlvr.sh>`__
+BLIP
+*******
+.. list-table::
+   :widths: 30 80 20
+   * - **Pretraining (14M)**
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/pretrain.sh>`__
+   * -
+     - Visual Genome (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_vg.py>`__)
+     -
+   * -
+     - SBU (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_sbu.py>`__)
+     -
+   * -
+     - CC3M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc3m.py>`__)
+     -
+   * -
+     - CC12M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc12m.py>`__)
+     -
+.. list-table::
+   :widths: 30 40 20 20 20 30 30
+   :header-rows: 1
+   * - **Tasks**
+     - **Retrieval**
+     - **R1**
+     - **R5**
+     - **R10**
+     - **Training**
+     - **Evaluation**
+   * - TR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 82.0
+     - 95.8
+     - 98.1
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_coco.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_coco.sh>`__
+   * - IR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 64.5
+     - 86.0
+     - 91.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_coco.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_coco.sh>`__
+   * - TR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 96.9
+     - 99.9
+     - 100.0
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_flickr.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_flickr.sh>`__
+   * - IR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 87.5
+     - 97.6
+     - 98.9
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_flickr.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_flickr.sh>`__
+.. list-table::
+   :widths: 20 20 20 20 20
+   :header-rows: 1
+   * - **VQA**
+     - **test-dev**
+     - **test-std/test**
+     - **Training**
+     - **Evaluation**
+   * - VQAv2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 78.23
+     - 78.29
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_vqa_albef.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/test_albef_vqa.sh>`__
+   * - OKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - NA
+     - 55.4
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_okvqa.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_okvqa.sh>`__
+   * - AOKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 56.2
+     - 50.1
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_aokvqa.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_aokvqa.sh>`__
+.. list-table::
+   :widths: 20 20 20 20 20 20
+   :header-rows: 1
+   * - **Image Captioning**
+     - **BLEU@4**
+     - **CIDEr**
+     - **SPICE**
+     - **Training**
+     - **Evaluation**
+   * - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 39.9
+     - 133.5
+     - 23.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_caption_coco.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_coco_cap.sh>`__
+   * - NoCaps (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_nocaps.py>`__)
+     - 31.9
+     - 109.1
+     - 14.7
+     - NA
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_nocaps.sh>`__
+.. list-table::
+   :widths: 20 20 20 20 20
+   :header-rows: 1
+   * - **Multimodal Classification**
+     - **val**
+     - **test**
+     - **Training**
+     - **Evaluation**
+   * - NLVR2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 82.48
+     - 83.25
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_nlvr.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_nlvr.sh>`__
+CLIP
+*******
+.. list-table::
+   :widths: 30 40 20 20 20 30
+   :header-rows: 1
+   * - **Tasks**
+     - **Retrieval (Zero-shot)**
+     - **R1**
+     - **R5**
+     - **R10**
+     - **Evaluation**
+   * - TR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 57.2
+     - 80.5
+     - 87.8
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_coco.sh>`__
+   * - IR
+     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
+     - 36.5
+     - 60.8
+     - 71.0
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_coco.sh>`__
+   * - TR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 86.5
+     - 98.0
+     - 99.1
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_flickr.sh>`__
+   * - IR
+     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
+     - 67.0
+     - 88.9
+     - 93.3
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_flickr.sh>`__
+.. list-table::
+   :widths: 20 20 20
+   :header-rows: 1
+   * - **Multimodal Classification**
+     - **val**
+     - **Evaluation**
+   * - ImageNet
+     - 76.5
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_zs_imnet.sh>`__
+ALPRO
+*******
+.. list-table::
+   :widths: 30 40 20 20 20 20 30
+   :header-rows: 1
+   * - **Tasks**
+     - **Retrieval**
+     - **R1**
+     - **R5**
+     - **R10**
+     - **Training**
+     - **Evaluation**
+   * - TR
+     - MSRVTT (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_msrvtt.py>`__)
+     - 33.2
+     - 60.5
+     - 71.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_ret.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_ret.sh>`__
+   * - VR
+     - MSRVTT (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_msrvtt.py>`__)
+     - 33.8
+     - 61.4
+     - 72.7
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_ret.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_ret.sh>`__
+   * - TR
+     - DiDeMo (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_didemo.py>`__)
+     - 38.8
+     - 66.4
+     - 76.8
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_didemo_ret.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_didemo_ret.sh>`__
+   * - VR
+     - DiDeMo (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_didemo.py>`__)
+     - 36.6
+     - 67.5
+     - 77.9
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_didemo_ret.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_didemo_ret.sh>`__
+.. list-table::
+   :widths: 20 20 20 20
+   :header-rows: 1
+   * - **Video QA**
+     - **test**
+     - **Training**
+     - **Evaluation**
+   * - MSRVTT
+     - 42.1
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_qa.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_qa.sh>`__
+   * - MSVD
+     - 46.0
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msvd_qa.sh>`__
+     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msvd_qa.sh>`__

docs/build_docs.sh ADDED Viewed

	@@ -0,0 +1,101 @@

+#!/bin/bash
+set -euo pipefail
+# Change to root directory of repo
+DIRNAME=$(cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
+cd "${DIRNAME}/.."
+# # Set up virtual environment
+pip3 install setuptools wheel virtualenv
+if [ ! -d venv ]; then
+  rm -f venv
+  virtualenv venv
+fi
+source venv/bin/activate
+# # Get current git branch & stash unsaved changes
+GIT_BRANCH=$(git branch --show-current)
+if [ -z "${GIT_BRANCH}" ]; then
+    GIT_BRANCH="main"
+fi
+git stash
+# Set up exit handler to restore git state & delete temp branches
+# function exit_handler {
+#     git reset --hard
+#     git checkout "${GIT_BRANCH}" --
+#     git stash pop || true
+#     for version in $(git tag --list 'v[0-9]*'); do
+#         branch="${version}_local_docs_only"
+#         if git show-ref --verify --quiet "refs/heads/$branch"; then
+#             git branch -D "$branch"
+#         fi
+#     done
+# }
+# trap exit_handler EXIT
+# Clean up build directory and install Sphinx requirements
+pip3 install -r "${DIRNAME}/requirements.txt"
+sphinx-build -M clean "${DIRNAME}" "${DIRNAME}/_build"
+# Build API docs for current head
+export current_version="latest"
+pip3 install "."
+sphinx-build -b html "${DIRNAME}" "${DIRNAME}/_build/html/${current_version}" -W --keep-going
+rm -rf "${DIRNAME}/_build/html/${current_version}/.doctrees"
+#pip3 uninstall -y omnixai
+# Install all previous released versions
+# and use them to build the appropriate API docs.
+# Uninstall after we're done with each one.
+# versions=()
+# checkout_files=("${DIRNAME}/*.rst" "lavis" "tutorials" "setup.py")
+# for version in $(git tag --list 'v[0-9]*'); do
+#     versions+=("$version")
+#     git checkout -b "${version}_local_docs_only"
+#     for f in $(git diff --name-only --diff-filter=A "tags/${version}" "${DIRNAME}/*.rst"); do
+#         git rm "$f"
+#     done
+#     git checkout "tags/${version}" -- "${checkout_files[@]}"
+#     export current_version=${version}
+#     pip3 install ".[all]"
+#     sphinx-build -b html "${DIRNAME}" "${DIRNAME}/_build/html/${current_version}" -W --keep-going
+#     rm -rf "${DIRNAME}/_build/html/${current_version}/.doctrees"
+#     #pip3 uninstall -y omnixai
+#     git reset --hard
+#     git checkout "${GIT_BRANCH}" --
+# done
+# Determine the latest stable version if there is one
+# if (( ${#versions[@]} > 0 )); then
+#   stable_hash=$(git rev-list --tags --max-count=1)
+#   stable_version=$(git describe --tags "$stable_hash")
+#   export stable_version
+# else
+export stable_version="latest"
+# fi
+# Create dummy HTML's for the stable version in the base directory
+while read -r filename; do
+    filename=$(echo "$filename" | sed "s/\.\///")
+    n_sub=$(echo "$filename" | (grep -o "/" || true) | wc -l)
+    prefix=""
+    for (( i=0; i<n_sub; i++ )); do
+        prefix+="../"
+    done
+    url="${prefix}${stable_version}/$filename"
+    mkdir -p "${DIRNAME}/_build/html/$(dirname "$filename")"
+    cat > "${DIRNAME}/_build/html/$filename" <<EOF
+<!DOCTYPE html>
+<html>
+   <head>
+      <title>LAVIS Documentation</title>
+      <meta http-equiv = "refresh" content="0; url='$url'" />
+   </head>
+   <body>
+      <p>Please wait while you're redirected to our <a href="$url">documentation</a>.</p>
+   </body>
+</html>
+EOF
+done < <(cd "${DIRNAME}/_build/html/$stable_version" && find . -name "*.html")
+echo "Finished writing to _build/html."

docs/conf.py ADDED Viewed

	@@ -0,0 +1,56 @@

+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+# -- Project information -----------------------------------------------------
+project = "LAVIS"
+copyright = "2022, salesforce.com inc."
+author = (
+    "Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C.H. Hoi"
+)
+# -- General configuration ---------------------------------------------------
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ["nbsphinx"]
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = []
+# -- Options for HTML output -------------------------------------------------
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+# html_theme = "alabaster"
+html_theme = "sphinx_rtd_theme"
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+# pygments_style = "sphinx"

docs/getting_started.rst ADDED Viewed

	@@ -0,0 +1,233 @@

+Dataset Zoo
+##################
+LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download scripts to help download and organize these datasets;
+and implements PyTorch datasets for these datasets. To view supported datasets, use the following code:
+.. code-block:: python
+    from lavis.datasets.builders import dataset_zoo
+    dataset_names = dataset_zoo.get_names()
+    print(dataset_names)
+    # ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
+    #  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
+    #  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
+    #  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
+    print(len(dataset_names))
+    # 23
+Auto-Downloading and Loading Datasets
+######################################
+We now take COCO caption dataset as an example to demonstrate how to download and prepare the dataset.
+In ``lavis/datasets/download_scripts/``, we provide tools to download most common public language-vision datasets supported by LAVIS.
+The COCO caption dataset uses images from COCO dataset. Therefore, we first download COCO images via:
+.. code-block:: bash
+    cd lavis/datasets/download_scripts/ && python download_coco.py
+This will automatically download and extract COCO images to the default LAVIS cache location.
+The default cache location is ``~/.cache/lavis``, defined in ``lavis/configs/default.yaml``.
+After downloading the images, we can use ``load_dataset()`` to obtain the dataset. On the first run, this will automatically download and cache annotation files.
+.. code-block:: python
+    from lavis.datasets.builders import load_dataset
+    coco_dataset = load_dataset("coco_caption")
+    print(coco_dataset.keys())
+    # dict_keys(['train', 'val', 'test'])
+    print(len(coco_dataset["train"]))
+    # 566747
+    print(coco_dataset["train"][0])
+    # {'image': <PIL.Image.Image image mode=RGB size=640x480>,
+    #  'text_input': 'A woman wearing a net on her head cutting a cake. ',
+    #  'image_id': 0}
+If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.
+.. code-block:: python
+    coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)
+Model Zoo
+####################################
+LAVIS supports a growing list of pre-trained models for different tasks,
+datatsets and of varying sizes. Let's get started by viewing the supported models.
+.. code-block:: python
+    from lavis.models import model_zoo
+    print(model_zoo)
+    # ==================================================
+    # Architectures                  Types
+    # ==================================================
+    # albef_classification           base, ve
+    # albef_nlvr                     base
+    # albef_pretrain                 base
+    # albef_retrieval                base, coco, flickr
+    # albef_vqa                      base, vqav2
+    # alpro_qa                       base, msrvtt, msvd
+    # alpro_retrieval                base, msrvtt, didemo
+    # blip_caption                   base, base_coco, large, large_coco
+    # blip_classification            base
+    # blip_feature_extractor         base
+    # blip_nlvr                      base
+    # blip_pretrain                  base
+    # blip_retrieval                 base, coco, flickr
+    # blip_vqa                       base, vqav2
+    # clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
+    # show total number of support model variants
+    len(model_zoo)
+    # 33
+Inference with Pre-trained Models
+####################################
+Now let's see how to use models in LAVIS to perform inference on example data. We first
+load a sample image from local.
+.. code-block:: python
+    from PIL import Image
+    # setup device to use
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # load sample image
+    raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
+This example image shows `Merlion park <https://en.wikipedia.org/wiki/Merlion>`_ (`image credit <https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/>`_), a landmark in Singapore.
+.. image:: _static/merlion.png
+Image Captioning
+*******************************
+We now use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
+pre-trained model with its preprocessors (transforms),  we use ``load_model_and_preprocess()`` with the following arguments:
+- ``name``: The name of the model to load. This could be a pre-trained model, task model, or feature extractor. See ``model_zoo`` for a full list of model names.
+- ``model_type``: Each architecture has variants trained on different datasets and at different scale. See Types column in ``model_zoo`` for a full list of model types.
+- ``is_eval``: if `True`, set the model to evaluation mode. This is desired for inference or feature extraction.
+- ``devce``: device to load the model to.
+.. code-block:: python
+    from lavis.models import load_model_and_preprocess
+    # loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
+    # this also loads the associated image processors
+    model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
+    # preprocess the image
+    # vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
+    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+    # generate caption
+    model.generate({"image": image})
+    # ['a large fountain spewing water into the air']
+You may also load models and their preprocessors separately via ``load_model()`` and ``load_processor()``.
+In BLIP, you can also generate diverse captions by turning nucleus sampling on.
+.. code-block:: python
+    from lavis.processors import load_processor
+    from lavis.models import load_model
+    # load image preprocesser used for BLIP
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    model = load_model(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
+    image = vis_processor(image).unsqueeze(0).to(device)
+    model.generate({"image": raw_image}, use_nucleus_sampling=True)
+    # one generated random sample: ['some very pretty buildings and some water jets']
+Visual question answering (VQA)
+*******************************
+BLIP model is able to answer free-form questions about images in natural language.
+To access the VQA model, simply replace the ``name`` and ``model_type`` arguments
+passed to ``load_model_and_preprocess()``.
+.. code-block:: python
+    from lavis.models import load_model_and_preprocess
+    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
+    # ask a random question.
+    question = "Which city is this photo taken?"
+    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+    question = txt_processors["eval"](question)
+    model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
+    # ['singapore']
+Unified Feature Extraction Interface
+####################################
+LAVIS provides a unified interface to extract multimodal features from each architecture.
+To extract features, we load the feature extractor variants of each model.
+The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity.
+.. code-block:: python
+    from lavis.models import load_model_and_preprocess
+    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
+    caption = "a large fountain spewing water into the air"
+    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+    text_input = txt_processors["eval"](caption)
+    sample = {"image": image, "text_input": [text_input]}
+    features_multimodal = model.extract_features(sample)
+    print(features_multimodal.keys())
+    # odict_keys(['image_embeds', 'multimodal_embeds'])
+    print(features_multimodal.multimodal_embeds.shape)
+    # torch.Size([1, 12, 768]), use features_multimodal[:, 0, :] for multimodal classification tasks
+    features_image = model.extract_features(sample, mode="image")
+    print(features_image.keys())
+    # odict_keys(['image_embeds', 'image_embeds_proj'])
+    print(features_image.image_embeds.shape)
+    # torch.Size([1, 197, 768])
+    print(features_image.image_embeds_proj.shape)
+    # torch.Size([1, 197, 256])
+    features_text = model.extract_features(sample, mode="text")
+    print(features_text.keys())
+    # odict_keys(['text_embeds', 'text_embeds_proj'])
+    print(features_text.text_embeds.shape)
+    # torch.Size([1, 12, 768])
+    print(features_text.text_embeds_proj.shape)
+    # torch.Size([1, 12, 256])
+    similarity = features_image.image_embeds_proj[:, 0, :] @ features_text.text_embeds_proj[:, 0, :].t()
+    print(similarity)
+    # tensor([[0.2622]])
+Since LAVIS supports a unified feature extraction interface, minimal changes are necessary to use a different model as feature extractor. For example,
+to use ALBEF as the feature extractor, one only needs to change the following line:
+.. code-block:: python
+    model, vis_processors, txt_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device)
+Similarly, to use CLIP as feature extractor:
+.. code-block:: python
+    model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="base", is_eval=True, device=device)
+    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="RN50", is_eval=True, device=device)
+    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="ViT-L-14", is_eval=True, device=device)

docs/index.rst ADDED Viewed

	@@ -0,0 +1,46 @@

+.. LAVIS documentation master file, created by
+   sphinx-quickstart on Sun Jul 31 10:32:27 2022.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+Welcome to LAVIS's documentation!
+=================================
+.. toctree::
+   :maxdepth: 1
+   :caption: Introduction
+   intro
+.. toctree::
+   :maxdepth: 1
+   :caption: Getting Started
+   getting_started
+..    :maxdepth: 1
+..    :caption: Advanced Training
+..    advanced_training
+.. toctree::
+   :maxdepth: 2
+   :caption: Advanced Usage
+   benchmark
+   tutorial
+.. Documentations
+.. ===================
+Indices and tables
+==================
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`

docs/intro.rst ADDED Viewed

	@@ -0,0 +1,99 @@

+What is LAVIS?
+####################################
+LAVIS is a Python deep learning library for LAnguage-and-VISion research and applications.
+It features a unified design to access state-of-the-art foundation language-vision models (`ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
+`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_), common tasks
+(retrieval, captioning, visual question answering, multimodal classification etc.) and datasets (COCO, Flickr, Nocaps, Conceptual
+Commons, SBU, etc.).
+This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal
+scenarios, and benchmark them across standard and customized datasets.
+Key features of LAVIS include:
+- **Modular and Extensible Library Design**: facilitating to easily utilize and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
+- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
+- **Reproducible Model Zoo**: provided training/pre-training recipies to easily replicate and extend state-of-the-art models.
+- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloaing scripts to help prepare a large variety of datasets and their annotations.
+Other features include:
+- **Distributed Training** using multiple GPUs on one machine or across multiple machines.
+- **Web Demo**: try supported models on your own pictures, questions etc.
+- **Leaderboard**: comparing state-of-the-art models across standard datasets.
+- **Dataset Explorer**: help browse and understand language-vision datasets.
+Supported Tasks, Models and Datasets
+####################################
+The following table shows the supported models and language-vision tasks by LAVIS. Adapting existing models to more tasks is possible and next to come in future releases.
+======================================== =========================== ============================================= ============
+Tasks                                     Supported Models            Supported Datasets                            Modalities
+======================================== =========================== ============================================= ============
+Image-text Pre-training                   ALBEF, BLIP                 COCO, VisualGenome, SBU, ConceptualCaptions  image, text
+Image-text Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text
+Text-image Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text
+Visual Question Answering                 ALBEF, BLIP                 VQAv2, OKVQA, A-OKVQA                        image, text
+Image Captioning                          BLIP                        COCO, NoCaps                                 image, text
+Image Classification                      CLIP                        ImageNet                                     image
+Natural Language Visual Reasoning (NLVR)  ALBEF, BLIP                 NLVR2                                        image, text
+Visual Entailment (VE)                    ALBEF                       SNLI-VE                                      image, text
+Visual Dialogue                           BLIP                        VisDial                                      image, text
+Video-text Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text
+Text-video Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text
+Video Question Answering (VideoQA)        BLIP, ALPRO                 MSRVTT, MSVD                                 video, text
+Video Dialogue                            VGD-GPT                     AVSD                                         video, text
+Multimodal Feature Extraction             ALBEF, CLIP, BLIP, ALPRO    customized                                   image, text
+======================================== =========================== ============================================= ============
+Library Design
+####################################
+.. image:: _static/architecture.png
+  :width: 550
+LAVIS has six key modules.
+- ``lavis.runners`` manages the overall training and evaluation lifecycle. It is also responsible for creating required components lazily as per demand, such as optimizers, learning rate schedulers and dataloaders. Currently ``RunnerBase`` implements epoch-based training and ``RunerIters`` implements iteration-based training.
+- ``lavis.tasks`` implements concrete training and evaluation logic per task. A task could be, for example, retrieval, captioning, pre-training. The rationale to have an abstraction of task is to accomodate task-specific training and evaluation. For example, evaluating a retrieval model is different from a classification model.
+- ``lavis.datasets`` is responsible for creating datasets, where ``lavis.datasets.builders`` loads dataset configurations, downloads annotations and returns a dataset object; ``lavis.datasets.datasets`` defines the supported datasets, each is a ``torch.utils.data.Dataset`` instance. We also provide `automatic dataset downloading tools` in ``datasets/download_scripts`` to help prepare common public datasets.
+- ``lavis.models`` holds definition for the supported models and shared model layers.
+- ``lavis.processors`` handles preprocessing of text and images/videos before feeding the model. For images and videos, a processor can be thought as transfroms in torchvision; for text input, this may include lowering case, truncation etc.
+- ``lavis.common`` module contains shared classes and methods used by multiple other modules. For example,
+   - ``lavis.common.config`` contains classes to store and manipulate configuration files used by LAVIS. In particular, we use a hierarchical configuration design, to allow highly customizable training and evaluation.
+   - ``lavis.common.registry``  serves as a centralized place to manage modules that share the same functionalities. It allows building datasets, models, tasks, and learning rate schedulers during runtime, by specifying their names as string in the configuration file.
+   - ``lavis.common.optims`` contains definitions of learning rate schedulers.
+   - ``lavis.common.dist_utils`` contains utilities for distributed training and evaluation.
+   - ``lavis.common.utils`` contains miscellaneous utilities, mostly IO-related helper functions.
+Installation
+############
+1. (Optional) Creating conda environment
+.. code-block:: bash
+   conda create -n lavis python=3.8
+   conda activate lavis
+2. Cloning and building from source
+.. code-block:: bash
+   git clone https://github.com/salesforce/LAVIS.git
+   cd LAVIS
+   pip install .
+If you would like to develop on LAVIS, you may find it easier to build with editable mode::
+   pip install -e .

docs/make.bat ADDED Viewed

	@@ -0,0 +1,35 @@

+@ECHO OFF
+pushd %~dp0
+REM Command file for Sphinx documentation
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+if "%1" == "" goto help
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+:end
+popd

docs/requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+GitPython
+ipykernel
+nbsphinx==0.8.7
+pandoc
+sphinx
+sphinx_autodoc_typehints
+sphinx_rtd_theme

docs/tutorial.configs.rst ADDED Viewed

	@@ -0,0 +1,172 @@

+.. _config:
+Training Models on Task Datasets (Commands and Configurations)
+#################################################################
+LAVIS provides scripts to pre-train and finetune supported models on standard language-vision tasks, stored at ``lavis/run_scripts/``.
+To replicate the experiments, just run these bash scripts. For example, to train BLIP model on the image-text retrieval task with MSCOCO dataset, we can run
+.. code-block::
+    bash run_scripts/lavis/blip/train/train_retrieval_coco.sh
+Inside the scripts, we can see
+.. code-block:: bash
+    python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/retrieval_coco_ft.yaml
+where we start a pytorch distributed training on 8 GPUs (you may change according to your own hardware setup). The ``--cfg-path`` specifys a `runtime configuration file`, specifying
+the task, model, dataset and training recipes.
+Available options and their descriptions are as below.
+.. LAVIS executes training and evaluation based on arguments specified in the configuration files. The default model and dataset configurations are defined in ``lavis/configs``. The task-specific configurations are defined in ``lavis/projects``. Task-specific configurations have higher priority over the default configurations.
+.. The following tables provide explanations for the arguments in the configuration files.
+.. list-table::
+   :widths: 30 40
+   :header-rows: 1
+   * - Model Configurations
+     - Functionalities
+   * - arch
+     - | name of the model from the model zoo
+       | default: task-dependent
+   * - model_type
+     - | the type of the model (e.g., base)
+       | default: task-dependent
+   * - load_pretrained
+     - | load pretrained weights
+       | default: True (for finetuning task) | False (for pretraining task)
+   * - load_finetuned
+     - | load task-specific finetuned weights
+       | default: False (for finetuning task) | True (for evaluation)
+   * - pretrained
+     - | URL or local path which stores the pretrained model, defined in the default model configuration file
+       | default: task-dependent
+   * - finetuned
+     - | URL or local path which stores the finetuned model, defined in the default model configuration file
+       | default: task-dependent
+.. list-table::
+   :widths: 30 50
+   :header-rows: 1
+   * - Dataset Configurations
+     - Functionalities
+   * - vis_processor
+     - | pre-processing of visual input
+       | default: task-dependent
+   * - text_processor
+     - | pre-processing of text input
+       | default: task-dependent
+   * - build_info
+     - | dataset information including the storage location, defined in the default dataset configuration file
+       | default: task-dependent
+.. list-table::
+   :widths: 30 50
+   :header-rows: 1
+   * - Runtime Configurations
+     - Functionalities
+   * - task
+     - | name of the task
+       | default: task-dependent
+   * - lr_sched
+     - | learning rate schedular
+       | default: linear_warmup_cosine_lr
+   * - init_lr
+     - | initial learning rate (after warmup)
+       | default: task-dependent
+   * - min_lr
+     - | final learning rate after decay
+       | default: task-dependent
+   * - warmup_lr
+     - | starting learning rate for warmup
+       | default: init_lr (no warmup)
+   * - lr_decay_rate
+     - | learning rate decay per epoch for step_lr_shedule
+       | default: 0.9
+   * - warmup_steps
+     - | number of steps for learning rate warmup
+       | default: 0
+   * - max_epoch
+     - | total number of training epochs
+       | default: task-dependent
+   * - weight_decay
+     - | weight decay coefficient for the optimizer
+       | default: 0.05
+   * - batch_size_train
+     - | batch size during training
+       | default: task-dependent
+   * - batch_size_eval
+     - | batch size during evaluation
+       | default: task-dependent
+   * - seed
+     - | pseudo random number generator seed
+       | default: 42
+   * - output_dir
+     - | directory to store logs, results and checkpoints
+       | default: task-dependent
+   * - resume_ckpt_path
+     - | path of the checkpoint to resume training from
+       | default: None
+   * - evaluate
+     - | only perform evaluation without training
+       | default: False
+   * - train_splits
+     - | dataset splits used for training
+       | default: ["train"]
+   * - valid_splits
+     - | dataset splits used for validation
+       | default: ["val"]
+   * - test
+     - | dataset splits used for test
+       | default: ["test"]
+   * - device
+     - | use cpu or gpu (cuda)
+       | default: cuda
+   * - world_size
+     - | number of processes participating in the job
+       | default: 1
+   * - dist_url
+     - | URL specifying how to initialize the process group
+       | default: "env://"
+   * - distributed
+     - | use distributed training
+       | default: True
+   * - amp
+     - | use automatic mixed precision training
+       | default: False
+.. list-table::
+   :widths: 40 50
+   :header-rows: 1
+   * - Text Generation Configurations
+     - Functionalities
+   * - max_len
+     - | maximum number of text tokens to generate
+       | default: 20 (for image captioning)
+   * - min_len
+     - | minimum number of text tokens to generate
+       | default: 5 (for image captioning)
+   * - num_beams
+     - | number of beams to perform beam search
+       | default: 3
+.. list-table::
+   :widths: 40 50
+   :header-rows: 1
+   * - Multimodal Retrieval Configurations
+     - Functionalities
+   * - negative_all_rank
+     - | collect negatives from all processes for the image-text matching loss
+       | default: True (for coco)
+   * - k_test
+     - | number of retrieval candidates ranked from contrastive similarity
+       | default: 256 (for coco)

docs/tutorial.datasets.rst ADDED Viewed

	@@ -0,0 +1,424 @@

+Adding Datasets
+################################################
+This is a tutorial on adding a new dataset using ``lavis.datasets`` module.
+The LAVIS library includes a standard dataset module, which allows customization to add new datasets.
+The ``lavis.datasets`` module is designed such that any new dataset class can be easily added and adapted from our code base, including creating dataset configuration, and defining and associating new dataset classes.
+In this tutorial, we will replicate the steps to add a dataset class for the `Audio-Visual Scene-Aware Dialogue (AVSD) <https://arxiv.org/pdf/1901.09107.pdf>`_ benchmark for the video-grounded dialogue task.
+Dataset Configuration ``lavis.configs.datasets``
+**************************************************************
+First, we define the basic configurations for this dataset, including a new dataset class ``avsd_dialogue``, dataset card, and data types.
+We can define any new dataset configuration in ``lavis.configs.datasets``. For instance, under this module, we can set up a configuration file ``avsd/defaults_dial.yaml`` as follows:
+.. code-block:: yaml
+    datasets:
+      avsd_dialogue: # name of the dataset builder
+        dataset_card: dataset_card/avsd_dialogue.md # path to the dataset card
+        data_type: features # [images|videos|features] we use features in this case for extracted video features
+        build_info:
+          # Be careful not to append minus sign (-) before split to avoid itemizing
+          annotations:
+            train:
+              url: /export/home/data/avsd/train_set4DSTC7-AVSD.json
+              storage: avsd/annotations/train.json
+            val:
+              url: /export/home/data/avsd/valid_set4DSTC7-AVSD.json
+              storage: avsd/annotations/val.json
+            test:
+              url: /export/home/data/avsd/test_set4DSTC7-AVSD.json
+              storage: avsd/annotations/test.json
+          features:
+            storage: /export/home/data/avsd/features/
+Dataset Card
+===============
+One optional step to set up dataset configuration is defining a dataset card, which contains more details about the dataset such as description, tasks, and metrics.
+For instance, we can define a dataset card for the AVSD benchmark in ``dataset_card/avsd_dialogue.md``.
+Depending on the dataset, we included in its corresponding dataset card the command for auto-downloading data (with python code defined in ``lavis.datasets.download_scripts``) that will automatically load the data and store it in a specific folder.
+Else, you should describe in the dataset card the external download instructions from the original data source to load the dataset properly.
+One example of a dataset card for the AVSD benchmark is:
+.. code-block:: md
+    ![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
+    # Audio-Visual Scene-Aware Dialogues (AVSD)
+    ## Description
+    [Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided.
+    ## Task
+    (https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
+    In a **video-grounded dialogue task**, the system must generate responses to user input in the context of a given dialog.
+    This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
+    ## Metrics
+    Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics.
+    ## Leaderboard
+    ....
+    ## Auto-Downloading
+    Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instructions to download the dataset.
+    ## References
+    "Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
+Visual Data Type
+==============================
+We currently limit the visual data types to one of three options: ``images``, ``videos``, and ``features``.
+"Images" and "videos" refer to the raw visual data, which is appropriate for models processing visual data in their original forms (e.g. ViT models).
+"Features" are visual representations extracted from pretrained models (e.g. CNN models).
+In this tutorial, the AVSD benchmark consists of video features extracted from 3D-CNN models.
+Build Info
+==============================
+Build info refers to the specific locations where data is stored and cached.
+For text annotations (e.g. captioning or dialogues), by default, we include three data splits, namely "train", "val", and "test", typically used in all machine learning projects.
+For each split, we specify 2 parameters: ``url``  and ``storage``.
+``url`` can be either an online URL where the dataset can be loaded automatically (e.g. from *googleapis*), or a local directory where data is already downloaded beforehand.
+``storage`` is the directory where the data will be cached over time, avoiding downloading data repeatedly.
+For visual data annotations, ensure the field name matches the data types defined earlier (e.g. one of "images", "videos" or features").
+As visual features are usually large and should be downloaded beforehand, we maintain only a ``storage`` parameter where visual data is cached.
+Dataset ``lavis.datasets.datasets``
+**************************************************************
+Base Dataset ``lavis.datasets.datasets.base_dataset``
+=======================================================
+In this step, we want to define new dataset classes that inherit our base dataset class ``lavis.datasets.datasets.base_dataset``. This base dataset class already defines standard methods such as ``collater`` which uses the default collator from Pytorch.
+.. code-block:: python
+    import json
+    from typing import Iterable
+    from torch.utils.data import Dataset, ConcatDataset
+    from torch.utils.data.dataloader import default_collate
+    class BaseDataset(Dataset):
+        def __init__(
+            self, vis_processor=None, text_processor=None, vis_root=None, ann_paths=[]
+        ):
+            """
+            vis_root (string): Root directory of images (e.g. coco/images/)
+            ann_root (string): directory to store the annotation file
+            """
+            self.vis_root = vis_root
+            self.annotation = []
+            for ann_path in ann_paths:
+                self.annotation.extend(json.load(open(ann_path, "r")))
+            self.vis_processor = vis_processor
+            self.text_processor = text_processor
+            self._add_instance_ids()
+        def __len__(self):
+            return len(self.annotation)
+        def collater(self, samples):
+            return default_collate(samples)
+        def set_processors(self, vis_processor, text_processor):
+            self.vis_processor = vis_processor
+            self.text_processor = text_processor
+        def _add_instance_ids(self, key="instance_id"):
+            for idx, ann in enumerate(self.annotation):
+                ann[key] = str(idx)
+Any dataset subclass will inherit these methods and it is optional to define and overwrite these methods accordingly to the specifications of the dataset.
+We encourage users not to modify the base dataset class as any modification will have cascading impacts on any other dataset classes that inherit this base dataset.
+Instead, the users should independently create new dataset classes to cater to their specific requirements.
+Dialogue Datasets ``lavis.datasets.datasets.dialogue_datasets``
+======================================================================
+For example, for the AVSD dataset, we want to define a new dataset subclass ``DialogueDataset`` for dialogue tasks. We can define this dataset class in ``lavis.datasets.datasets.dialogue_datasets`` as following:
+.. code-block:: python
+    import os
+    from collections import OrderedDict
+    from lavis.datasets.datasets.base_dataset import BaseDataset
+    import json
+    import copy
+    class DialogueDataset(BaseDataset):
+        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
+            """
+            vis_processor (string): visual processor
+            text_processor (string): textual processor
+            vis_root (string): Root directory of images (e.g. coco/images/)
+            ann_paths (string): Root directory of images (e.g. coco/images/)
+            """
+            self.vis_root = vis_root
+            self.annotation = []
+            for ann_path in ann_paths:
+                dialogs = json.load(open(ann_path, "r"))['dialogs']
+                for dialog in dialogs:
+                    all_turns = dialog['dialog']
+                    dialogue_context = []
+                    for turn in all_turns:
+                        dialog_instance = copy.deepcopy(dialog)
+                        question = turn['question']
+                        answer = turn['answer']
+                        dialog_instance['dialog'] = copy.deepcopy(dialogue_context)
+                        dialog_instance['question'] = question
+                        dialog_instance['answer'] = answer
+                        self.annotation.append(dialog_instance)
+                        dialogue_context.append(turn)
+            self.vis_processor = vis_processor
+            self.text_processor = text_processor
+            self._add_instance_ids()
+            self.img_ids = {}
+            n = 0
+            for ann in self.annotation:
+                img_id = ann["image_id"]
+                if img_id not in self.img_ids.keys():
+                    self.img_ids[img_id] = n
+                    n += 1
+Class inheritance allows us to define multiple subclasses. For instance, we want another dialogue dataset class that is defined only for the test split. We can define another dataset class ``DialogueEvalDataset`` as similarly defined above but the annotations are processed differently.
+Typically, in dialogue tasks, during test time, only a single test sample is constructed per dialogue (rather than decomposing all dialogue turns as samples during training time).
+The dataset class can then be defined as:
+.. code-block:: python
+    class DialogueEvalDataset(BaseDataset):
+        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
+            # ...
+            # defined similarly as DialogueDataset above
+            # except for the loading of dialogue annotation data
+            self.annotation = []
+            for ann_path in ann_paths:
+                dialogs = json.load(open(ann_path, "r"))['dialogs']
+                for dialog in dialogs:
+                    all_turns = dialog['dialog']
+                    dialogue_context = all_turns[:-1]
+                    last_turn = all_turns[-1]
+                    question = last_turn['question']
+                    answer = last_turn['answer']
+                    dialog['dialog'] = dialogue_context
+                    dialog['question'] = question
+                    dialog['answer'] = answer
+                    self.annotation.append(dialog)
+Using class inheritance to define datasets also allows us to develop more fine-grain class implementations, each of which is specifically designated for a benchmark.
+For instance, under the dialogue-based tasks, we can further define another dataset subclass that is specified for the AVSD dataset.
+We can define a new class ``AVSDDialDataset`` that further specifies how to load individual samples and collate them accordingly to specific requirements:
+.. code-block:: python
+    import os
+    from lavis.datasets.datasets.base_dataset import BaseDataset
+    from lavis.datasets.datasets.dialogue_datasets import DialogueDataset, DialogueEvalDataset
+    import torch
+    class AVSDDialDataset(DialogueDataset):
+        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
+            super().__init__(vis_processor, text_processor, vis_root, ann_paths)
+        def __getitem__(self, index):
+            ann = self.annotation[index]
+            vname = ann["image_id"]
+            video = self.vis_processor(self.vis_root, vname)
+            dialogue = self.text_processor(ann)
+            return {
+                "video_fts": video['video_fts'],
+                "video_token_type_ids": video['token_type_ids'],
+                "input_ids": dialogue['input_ids'],
+                "token_type_ids": dialogue['token_type_ids'],
+                "labels": dialogue['labels'],
+                "image_id": ann["image_id"],
+                "instance_id": ann["instance_id"]
+            }
+        def collater(self, samples):
+            input_ids, token_type_ids, labels, video_fts, video_token_type_ids = [], [], [], [], []
+            for i in samples:
+                input_ids.append(i['input_ids'])
+                token_type_ids.append(i['token_type_ids'])
+                labels.append(i['labels'])
+                video_fts.append(i['video_fts'])
+                video_token_type_ids.append(i['video_token_type_ids'])
+            input_ids = self.text_processor.padding(input_ids)
+            labels = self.text_processor.padding(labels, -1)
+            video_fts = self.vis_processor.padding(video_fts)
+            token_type_ids = self.text_processor.padding(token_type_ids)
+            video_token_type_ids = self.text_processor.padding(video_token_type_ids)
+            token_type_ids = torch.cat([video_token_type_ids, token_type_ids], dim=1)
+            attn_mask = self.text_processor.get_attention_mask(input_ids)
+            video_mask = self.vis_processor.get_attention_mask(video_fts)
+            attn_mask = torch.cat([video_mask, attn_mask], dim=1)
+            video_labels = torch.ones((video_fts.size(0), video_fts.size(1))).long() * -1 # ignore token indice -1 by default
+            labels = torch.cat([video_labels, labels], dim=1)
+            samples = {}
+            samples['input_ids'] = input_ids
+            samples['token_type_ids'] = token_type_ids
+            samples['labels'] = labels
+            samples['video_fts'] = video_fts
+            samples['attn_mask'] = attn_mask
+            return samples
+Note that in a dataset subclass, if methods such as ``__getitem__`` and ``collater`` are not defined, the same functions from the corresponding superclass will be used.
+For instance, by default, we always use the collater from the ``BaseDataset`` class to collate data samples.
+Dataset Builder ``lavis.datasets.builders``
+**************************************************************
+Dataset Builder is the data processing module that controls the dataset classes (by training or evaluation split) and associates the specific dataset configurations to these dataset classes.
+Base Dataset Builder ``lavis.datasets.builders.base_dataset_builder``
+======================================================================
+Note that any new builder class definition should inherit the base dataset builder class ``lavis.datasets.builders.base_dataset_builder``:
+.. code-block:: python
+    class BaseDatasetBuilder:
+        train_dataset_cls, eval_dataset_cls = None, None
+        ...
+This allows us to standardize the operations of dataset builders across all builder classes. We advise the users to carefully review the standard methods defined in the base builder class, including methods such as ``_download_data`` and ``build_dataset`` that will load download the data and create instances of dataset classes:
+.. code-block:: python
+    class BaseDatasetBuilder:
+    ...
+        def build_datasets(self):
+            # download, split, etc...
+            # only called on 1 GPU/TPU in distributed
+            if is_main_process():
+                self._download_data()
+            if is_dist_avail_and_initialized():
+                dist.barrier()
+            # at this point, all the annotations and image/videos should be all downloaded to the specified locations.
+            logging.info("Building datasets...")
+            datasets = self.build()  # dataset['train'/'val'/'test']
+            return datasets
+        def _download_data(self):
+            self._download_ann()
+            self._download_vis()
+We encourage users not to modify the implementation of the base dataset builder class as this will affect all existing dataset builder subclasses.
+Dialogue Dataset Builder ``lavis.datasets.builders.dialogue_builder``
+======================================================================
+We can define any new builder subclass and associate this builder with the corresponding dataset classes and dataset configurations.
+For instance, for the AVSD dataset, we can define a builder ``lavis.datasets.builders.dialogue_builder`` for dialogue-based datasets as follows:
+.. code-block:: python
+    from lavis.datasets.builders.base_dataset_builder import BaseDatasetBuilder
+    from lavis.datasets.datasets.avsd_dialogue_datasets import (
+        AVSDDialDataset,
+        AVSDDialEvalDataset
+    )
+    from lavis.common.registry import registry
+    @registry.register_builder("avsd_dialogue")
+    class AVSDDialBuilder(BaseDatasetBuilder):
+        train_dataset_cls = AVSDDialDataset
+        eval_dataset_cls = AVSDDialEvalDataset
+        DATASET_CONFIG_DICT = {
+            "default": "configs/datasets/avsd/defaults_dial.yaml"
+        }
+Note that we chose to separately define the parameters ``train_dataset_cls`` and  ``eval_dataset_cls`` to consider cases where data is processed differently between training and test time.
+For instance, in captioning tasks, during test time, each data sample often includes multiple ground-truth captions rather than just a single ground-truth during training time.
+If the data processing is the same in both training and test time, the two parameters can be linked to the same dataset class.
+Finally, define ``DATASET_CONFIG_DICT`` to associate the dataset configurations to the assigned dataset classes.
+Registering Builder ``lavis.datasets.builders.__init__``
+======================================================================
+To add a new builder class, ensure to first include the class within the ``__init__.py``. For instance, to define a new builder for the AVSD dataset:
+.. code-block:: python
+    from lavis.datasets.builders.dialogue_builder import (
+        AVSDDialBuilder
+    )
+    __all__ = [
+        ...,
+        "AVSDDialBuilder"
+    ]
+Assigning Builder
+======================================================================
+Note that during data loading and processing, the builder being assigned must have the correct registry to be able to load it properly.
+For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:
+.. code-block:: yaml
+    datasets:
+      avsd_dialogue: # name of the dataset builder
+        ...
+        # processor configuration
+        ...
+Subsequently, any processes (e.g. training) should load this configuration file to assign the correct builder which will then associate the correct dataset classes to construct data samples.
+.. code-block:: sh
+    python train.py --cfg-path dialogue_avsd_ft.yaml

docs/tutorial.evaluation.rst ADDED Viewed

	@@ -0,0 +1,40 @@

+Evaluating Pre-trained Models on Task Datasets
+###############################################
+LAVIS provides pre-trained and finetuned model for off-the-shelf evaluation on task dataset.
+Let's now see an example to evaluate BLIP model on the captioning task, using MSCOCO dataset.
+.. _prep coco:
+Preparing Datasets
+******************
+First, let's download the dataset. LAVIS provides `automatic downloading scripts` to help prepare
+most of the public dataset, to download MSCOCO dataset, simply run
+.. code-block:: bash
+    cd lavis/datasets/download_scripts && bash download_coco.py
+This will put the downloaded dataset at a default cache location ``cache`` used by LAVIS.
+If you want to use a different cache location, you can specify it by updating ``cache_root`` in ``lavis/configs/default.yaml``.
+If you have a local copy of the dataset, it is recommended to create a symlink from the cache location to the local copy, e.g.
+.. code-block:: bash
+    ln -s /path/to/local/coco cache/coco
+Evaluating pre-trained models
+******************************
+To evaluate pre-trained model, simply run
+.. code-block:: bash
+    bash run_scripts/lavis/blip/eval/eval_coco_cap.sh
+Or to evaluate a large model:
+.. code-block:: bash
+    bash run_scripts/lavis/blip/eval/eval_coco_cap_large.sh

docs/tutorial.models.rst ADDED Viewed

	@@ -0,0 +1,245 @@

+Adding Models
+####################################
+This is a tutorial on adding new models using ``lavis.models`` module.
+The LAVIS library includes a standard model module that builds the foundation for many major language-vision models such as `ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
+`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, and `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_.
+The ``lavis.models`` module is designed such that any new models can be added and integrated into the LAVIS library, with minimal steps to develop training and testing procedures.
+In this tutorial, we will replicate the steps to add a GPT-style model specifically for `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_.
+Base Model ``lavis.models.base_model``
+**************************************************************
+Note that any new model definition should inherit the base model class ``BaseModel``:
+.. code-block:: python
+    from omegaconf import OmegaConf
+    import numpy as np
+    import torch
+    import torch.nn as nn
+    from lavis.common.utils import get_abs_path
+    class BaseModel(nn.Module):
+        """Base class for models."""
+        def __init__(self):
+            super().__init__()
+        def forward_features(self, *args, **kwargs):
+            """Similar to *forward* but only return features."""
+            raise NotImplementedError
+        def load_from_pretrained(self, url_or_filename):
+            raise NotImplementedError
+        @classmethod
+        def _from_config(cls, cfg=None, model_type="base"):
+            if not cfg:
+                # useful when building model without a provided configuration file
+                cfg = OmegaConf.load(cls.default_config_path(model_type)).model
+            return cls.from_config(cfg)
+        @classmethod
+        def from_pretrained(cls, model_type="base"):
+            """
+            Build a pretrained model from the default configuration file, specified by model_type.
+            """
+            return cls._from_config(cfg=None, model_type=model_type)
+        @property
+        def device(self):
+            return list(self.parameters())[0].device
+        @classmethod
+        def default_config_path(cls, model_type="base"):
+            assert (
+                model_type in cls.PRETRAINED_MODEL_CONFIG_DICT
+            ), "Unknown model type {}".format(model_type)
+            return get_abs_path(cls.PRETRAINED_MODEL_CONFIG_DICT[model_type])
+        def before_evaluation(self, **kwargs):
+            pass
+        def show_n_params(self, return_str=True):
+            tot = 0
+            for p in self.parameters():
+                w = 1
+                for x in p.shape:
+                    w *= x
+                tot += w
+            if return_str:
+                if tot >= 1e6:
+                    return "{:.1f}M".format(tot / 1e6)
+                else:
+                    return "{:.1f}K".format(tot / 1e3)
+            else:
+                return tot
+In this base model, we already declare and standardize many common methods such as ``_from_config`` and ``_from_pretrained``.
+Inheriting this base model class allows us to standardize operations of models across all model classes while still allowing customizations.
+We advise users not to change the implementation of the base model class as this will affect all existing model subclasses.
+GPT-style Video-grounded Dialogue Model ``lavis.models.gpt_models.gpt_dialogue``
+********************************************************************************
+In this step, we can define a new model class, e.g. under ``lavis.models.gpt_models.gpt_dialogue``, for GPT-based dialogue models designed specifically for video-grounded dialogues.
+Note that we assume the model class inherits from the standard model super class ``GPT2LMHeadModel`` from the ``transformers`` `library <https://huggingface.co/docs/transformers/index>`_.
+We also enforce model integration to the LAVIS framework through the inheritance of the ``BaseModel`` from the LAVIS library, as the secondary super class.
+.. code-block:: python
+    import torch
+    from lavis.common.registry import registry
+    from lavis.models.base_model import BaseModel
+    from transformers import GPT2Model, GPT2LMHeadModel
+    from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
+    import math
+    import torch
+    import torch.nn as nn
+    from torch.nn import CrossEntropyLoss, MSELoss
+    @registry.register_model("gpt_dialogue")
+    class GPTDialogue(GPT2LMHeadModel, BaseModel):
+        ...
+Next, we can modify the architecture of the model during model initialization to fit the tasks of interest, i.e. video-grounded dialogues.
+In this case, we want to add additional model parameters for a linear network to transform the video feature representations to the model dimension.
+.. code-block:: python
+    class GPTDialogue(GPT2LMHeadModel, BaseModel):
+        def __init__(self, config, len_video_ft=4224):
+            super().__init__(config)
+            self.video_ff = nn.Linear(len_video_ft, config.n_embd)
+            # Model parallel
+            self.model_parallel = False
+            self.device_map = None
+            # Initialize weights and apply final processing
+            self.post_init()
+Note that for each new model class, we advise redefining the ``from_config`` method which is inherited from the ``BaseModel`` class.
+As each model usually has its own unique configurations, redefining the method will ensure the model instances are created properly.
+For instance, ``GPTDialogue`` requires an additional parameter of video feature length (``len_video_ft``) which should be part of the model initialization procedure.
+Another additional parameter is the number of tokens/words (as we include additional special tokens in the vocabulary for dialogue tasks).
+.. code-block:: python
+    class GPTDialogue(GPT2LMHeadModel, BaseModel):
+        ...
+        @classmethod
+        def from_config(cls, cfg):
+            model = cls.from_pretrained('gpt2', len_video_ft=cfg['len_video_ft'])
+            model.resize_token_embeddings(cfg['len_tokenizer'])
+            return model
+Other basic methods should also be defined explicitly in the new model class, including the ``forward`` function.
+For instance, in GPT models for video-grounded dialogue tasks, we want the forward operation also includes the transformation and integration of video features before passing the representations to the Transformer layers.
+.. code-block:: python
+    class GPTDialogue(GPT2LMHeadModel, BaseModel):
+        ...
+        def forward(self, samples,
+                    past_key_values=None,
+                    position_ids=None,
+                    head_mask=None,
+                    encoder_hidden_states=None,
+                    encoder_attention_mask=None,
+                    use_cache=None,
+                    output_attentions=None,
+                    output_hidden_states=None,
+                    return_dict=None):
+                input_embs = self.transformer.wte(samples['input_ids'])
+                video_embs = self.video_ff(samples['video_fts'])
+                input_embs = torch.cat([video_embs, input_embs], dim=1)
+                transformer_outputs = self.transformer(
+                    attention_mask=samples['attn_mask'],
+                    token_type_ids=samples['token_type_ids'],
+                    inputs_embeds=input_embs,
+                    position_ids=position_ids,
+                    head_mask=head_mask,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                    output_hidden_states=output_hidden_states,
+                    return_dict=return_dict,
+                )
+                hidden_states = transformer_outputs[0]
+                lm_logits = self.lm_head(hidden_states)
+                ...
+Registering New Model ``lavis.models.__init__``
+********************************************************************************
+Any new model must be officially registered as part of the ``lavis.models`` module.
+For instance, to add a model class for GPT-based dialogue models, we can modify the ``__init__.py`` as follows:
+.. code-block:: python
+    from lavis.models.gpt_models.gpt_dialogue import GPTDialogue
+    __all__ = [
+        ...
+        "GPTDialogue"
+    ]
+Assigning Model
+********************************************************************************
+From the above example of a model class, note that we define a ``from_config method`` for the new model class.
+This method will process a configuration file and pass specific parameters to initialize the model classes properly.
+To do this, we can assign/ associate the correct registry of model classes in a configuration file.
+For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:
+.. code-block:: yaml
+    model:
+      arch: gpt_dialogue # name of the model
+      model_type: base
+Subsequently, any processes (e.g. training) should load this configuration file to assign the correct model.
+.. code-block:: sh
+    python train.py --cfg-path dialogue_avsd_ft.yaml
+Note that to simplify the model configuration, we only enable two main parameters here: ``arch`` and ``model_type``. ``arch`` refers to the model class registry, and ``model_type`` is the corresponding model type under this model family.
+For instance, with ``gpt_dialogue``, we have a model ``base`` which has its own configuration in a separate configuration file e.g. ``gpt_dialogue_base.yaml``:
+.. code-block:: yaml
+    model:
+      arch: gpt_dialogue
+      len_tokenizer: 50264 # 50257 tokens from gpt2 default tokenizer + additional special tokens
+      len_video_ft: 4224 # i3d_rgb: 2048 i3d_flow: 2048 vggish: 128
+We can pass load this configuration and pass the parameters to the above ``from_config`` method to initialize the model accordingly.
+We advise the users to maintain a dictionary that contains default paths to model configurations, in the model class definition.
+By default, the LAVIS framework will search for configurations from each model class defined as ``model.PRETRAINED_MODEL_CONFIG_DICT``.
+.. code-block:: python
+    class GPTDialogue(GPT2LMHeadModel, BaseModel):
+        PRETRAINED_MODEL_CONFIG_DICT = {
+                "base": "configs/models/gpt_dialogue_base.yaml"
+            }
+        ...

docs/tutorial.processors.rst ADDED Viewed

	@@ -0,0 +1,233 @@

+Adding Processors
+################################################
+This is a tutorial on adding new processors using ``lavis.processors`` module.
+The LAVIS library includes a standard processor module that preprocesses data e.g. image transformation and sequence concatenation.
+The ``lavis.processors`` module is designed such that any processors can be added, specifically to the requirements of corresponding models of interest.
+In this tutorial, we will replicate the steps to add visual and textual processors specifically for `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_.
+In addition, we also want the processors to have processing features to make the data samples compatible with GPT-style models.
+Base Processor ``lavis.processors.base_processors``
+*****************************************************
+Note that any new processor definition should inherit the base processor class ``BaseProcessor``:
+.. code-block:: python
+    from omegaconf import OmegaConf
+    class BaseProcessor:
+        def __init__(self):
+            self.transform = lambda x: x
+            return
+        def __call__(self, item):
+            return self.transform(item)
+        @classmethod
+        def from_config(cls, cfg=None):
+            return cls()
+        def build(self, **kwargs):
+            cfg = OmegaConf.create(kwargs)
+            return self.from_config(cfg)
+This allows us to standardize operations of processors across all processor classes while still allowing customization of processors specifically to data and model types.
+We encourage users not to modify the implementation of the base processor class as this will have an impact on all existing processor subclasses.
+GPT-style Processors ``lavis.processors.gpt_processors``
+**************************************************************
+In this step, we can define new processor classes, e.g. under ``lavis.processors.gpt_processors``, for GPT models designed specifically for video-grounded dialogues.
+First, we want to process video features by defining ``GPTVideoFeatureProcessor`` class.
+In this tutorial, we assume video features are extracted beforehand and this processor simply loads the features from ``npy`` files.
+Other methods that are specifically defined are ``padding`` (which is used by dataset instances to pad multiple video samples) and ``get_attention_mask`` (which creates an attention mask for Transformer attention in GPT models).
+.. code-block:: python
+    SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
+    ...
+    @registry.register_processor("gpt_video_ft")
+    class GPTVideoFeatureProcessor(BaseProcessor):
+        def __init__(self, visual_ft, audio_ft):
+            self.visual_ft = visual_ft
+            self.audio_ft = audio_ft
+            self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+            self.tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
+        def padding(self, seq):
+            padded_seq = torch.nn.utils.rnn.pad_sequence(seq, batch_first=True, padding_value=1.0)
+            return padded_seq
+        def get_attention_mask(self, seq):
+            return torch.sum(seq != 1, dim=2) != 0
+        def __call__(self, ft_root, vname):
+            all_ft = []
+            for ft_name in self.visual_ft:
+                ft_path = os.path.join(ft_root, ft_name, vname)
+                all_ft.append(np.load(ft_path + '.npy'))
+            for ft_name in self.audio_ft:
+                ft_path = os.path.join(ft_root, ft_name, vname)
+                all_ft.append(np.load(ft_path + '.npy'))
+            min_len = min([len(ft) for ft in all_ft])
+            sampled_ft = [ft[:min_len] for ft in all_ft]
+            sampled_ft = np.concatenate(sampled_ft, axis=1)
+            item = {}
+            item['video_fts'] = torch.Tensor(sampled_ft)
+            video_type_token = self.tokenizer.convert_tokens_to_ids('<video>')
+            item['token_type_ids'] = torch.Tensor([video_type_token] * len(sampled_ft)).long()
+            return item
+        @classmethod
+        def from_config(cls, cfg=None):
+            if cfg is None:
+                cfg = OmegaConf.create()
+            visual_ft = cfg.get("visual_ft", ["i3d_rgb"])
+            audio_ft = cfg.get("audio_ft", ["vggish"])
+            return cls(
+                visual_ft=visual_ft,
+                audio_ft=audio_ft
+            )
+Another processor class that will be useful to have is to process dialogue data. Here we can define a ``GPTDialogueProcessor`` class.
+This processor class receives raw annotations and constructs inputs as a concatenation of input sequences (questions, dialogue contexts, and responses) to facilitate application in GPT models.
+Other methods that are specifically defined are ``padding`` (which is used by dataset instances to pad multiple sequence samples) and ``get_attention_mask`` (which creates an attention mask for Transformer attention in GPT models).
+.. code-block:: python
+    SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
+    ...
+    @registry.register_processor("gpt_dialogue")
+    class GPTDialogueProcessor(BaseProcessor):
+        def __init__(self, max_turns=3, use_caption=True):
+            self.max_turns = max_turns
+            self.use_caption = use_caption
+            self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+            self.tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
+        def sample_sequence(self, caption, history, answer):
+            bos, eos, speaker1, speaker2, cap = self.tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-2])
+            instance = {}
+            sequence = [caption] + history + [answer]
+            sequence = [s + [eos] for s in sequence]
+            instance["input_ids"] = list(chain(*sequence))
+            instance["token_type_ids"] = [cap] * len(sequence[0]) + [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence[1:]) for _ in s]
+            instance["labels"] = ([-1]*sum(len(s) for s in sequence[:-1])) + sequence[-1]
+            assert len(instance["input_ids"])==len(instance["token_type_ids"])
+            assert len(instance["token_type_ids"])==len(instance["labels"])
+            for k,v in instance.items():
+                instance[k] = torch.Tensor(v).long()
+            return instance
+        def padding(self, seq, pad_token=-1):
+            if pad_token==-1: pad_token = self.tokenizer.pad_token_id
+            padded_seq = torch.nn.utils.rnn.pad_sequence(seq, batch_first=True, padding_value=pad_token)
+            return padded_seq
+        def get_attention_mask(self, seq, pad_token=-1):
+            if pad_token==-1: pad_token = self.tokenizer.pad_token_id
+            return seq != pad_token
+        def __call__(self, ann):
+            if self.use_caption:
+                caption = ' '.join([ann['caption'], ann['summary']])
+                caption = self.tokenizer.encode(caption)
+            else:
+                caption = []
+            dial_history = []
+            for turn in ann['dialog'][-self.max_turns:]:
+                dial_history.append(turn['question'])
+                dial_history.append(turn['answer'])
+            dial_history.append(ann['question'])
+            dial_history = [self.tokenizer.encode(t) for t in dial_history]
+            answer = self.tokenizer.encode(ann['answer'])
+            item = self.sample_sequence(caption, dial_history, answer)
+            return item
+        @classmethod
+        def from_config(cls, cfg=None):
+            if cfg is None:
+                cfg = OmegaConf.create()
+            use_caption = cfg.get("use_caption", True)
+            max_turns = cfg.get("max_turns", 3)
+            return cls(max_turns=max_turns, use_caption=use_caption)
+Registering New Processors ``lavis.processors.__init__``
+**************************************************************
+Finally, any new processor must be officially registered as part of the ``lavis.processors`` module.
+For instance, to add processor classes for GPT-based dialogue models, including one for dialogue data ``GPTDialogueProcessor`` and one for video features ``GPTVideoFeatureProcessor``, we can modify the ``__init__.py`` as follows:
+.. code-block:: python
+    from lavis.processors.gpt_processors import (
+        GPTVideoFeatureProcessor,
+        GPTDialogueProcessor,
+    )
+    __all__ = [
+        ...
+        # GPT
+        "GPTVideoFeatureProcessor",
+        "GPTDialogueProcessor"
+    ]
+Assigning Processors
+**************************************************************
+From the above example of processor classes, note that we define a ``from_config`` method for each class.
+This method will process a configuration file and pass specific parameters e.g. ``max_turns``, ``visual_ft``, to initialize the processor classes properly.
+To do this, we can assign/ associate the correct registry of processor classes in a configuration file.
+For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:
+.. code-block:: yaml
+    datasets:
+      avsd_dialogue: # name of the dataset builder
+        vis_processor:
+            train:
+              name: "gpt_video_ft" # name of the visual processor for training data
+              visual_ft: ["i3d_flow", "i3d_rgb"]
+              audio_ft: ["vggish"]
+            eval:
+              name: "gpt_video_ft" # name of the visual processor for evaluation data
+              visual_ft: ["i3d_flow", "i3d_rgb"]
+              audio_ft: ["vggish"]
+        text_processor:
+            train:
+              name: "gpt_dialogue" # name of the textual processor for training data
+              max_turns:  3
+              use_caption: True
+            eval:
+              name: "gpt_dialogue" # name of the textual processor for evaluation data
+              max_turns:  3
+              use_caption: True
+Subsequently, any processes (e.g. training) should load this configuration file to assign the correct processors.
+.. code-block:: sh
+    python train.py --cfg-path dialogue_avsd_ft.yaml

docs/tutorial.rst ADDED Viewed

	@@ -0,0 +1,13 @@

+Tutorials
+==============================
+.. toctree::
+   :maxdepth: 1
+   tutorial.evaluation
+   tutorial.training-example
+   tutorial.configs
+   tutorial.datasets
+   tutorial.processors
+   tutorial.models
+   tutorial.tasks

docs/tutorial.tasks.rst ADDED Viewed

	@@ -0,0 +1,184 @@

+Adding Tasks
+####################################
+This is a tutorial on adding new machine learning tasks using ``lavis.tasks`` module.
+The LAVIS library includes a standard task module that centralizes the model training and evaluation procedure of machine learning tasks.
+The ``lavis.tasks`` module is designed such that any new tasks can be added and integrated, catering to any customization in the training and testing procedures.
+In this tutorial, we will replicate the steps to add a new task into LAVIS for the `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_.
+Base Task ``lavis.tasks.base_task``
+********************************************************************************
+Note that any new model definition should inherit the base task class ``BaseTask``:
+.. code-block:: python
+    import logging
+    import os
+    import torch.distributed as dist
+    from lavis.common.dist_utils import get_rank, get_world_size, is_main_process
+    from lavis.common.logger import MetricLogger, SmoothedValue
+    from lavis.common.registry import registry
+    from lavis.datasets.data_utils import prepare_sample
+    class BaseTask:
+        def __init__(self, **kwargs):
+            super().__init__()
+            self.inst_id_key = "instance_id"
+        @classmethod
+        def setup_task(cls, **kwargs):
+            return cls()
+        def build_model(self, cfg):
+            model_config = cfg.model_cfg
+            model_cls = registry.get_model_class(model_config.arch)
+            return model_cls.from_config(model_config)
+        def build_datasets(self, cfg):
+            """
+            Build a dictionary of datasets, keyed by split 'train', 'valid', 'test'.
+            Download dataset and annotations automatically if not exist.
+            Args:
+                cfg (common.config.Config): _description_
+            Returns:
+                dict: Dictionary of torch.utils.data.Dataset objects by split.
+            """
+            datasets = dict()
+            datasets_config = cfg.datasets_cfg
+            assert len(datasets_config) > 0, "At least one dataset has to be specified."
+            for name in datasets_config:
+                dataset_config = datasets_config[name]
+                builder = registry.get_builder_class(name)(dataset_config)
+                dataset = builder.build_datasets()
+                datasets[name] = dataset
+            return datasets
+        def train_step(self, model, samples):
+            loss = model(samples)["loss"]
+            return loss
+        ...
+In this base task, we already declare and standardize many common methods such as ``train_step``, ``build_model``, and ``build_datasets``.
+Inheriting this base task class allows us to standardize operations of tasks across all task classes.
+We recommend users not change the implementation of the base task class as this will have an impact on all existing task subclasses.
+Dialogue Task ``lavis.tasks.dialogue``
+********************************************************************************
+In this step, we can define a new task class, e.g. under ``lavis.tasks.dialogue``, for video-grounded dialogues.
+For instance, we define a new task class ``DialogueTask`` that inherits the super task class ``BaseTask``.
+.. code-block:: python
+    import json
+    import os
+    from lavis.common.dist_utils import main_process
+    from lavis.common.logger import MetricLogger
+    from lavis.common.registry import registry
+    from lavis.tasks.base_task import BaseTask
+    from lavis.datasets.data_utils import prepare_sample
+    import numpy as np
+    @registry.register_task("dialogue")
+    class DialogueTask(BaseTask):
+        def __init__(self, num_beams, max_len, min_len, evaluate, report_metric=True):
+            super().__init__()
+            self.num_beams = num_beams
+            self.max_len = max_len
+            self.min_len = min_len
+            self.evaluate = evaluate
+            self.report_metric = report_metric
+        @classmethod
+        def setup_task(cls, cfg):
+            run_cfg = cfg.run_cfg
+            num_beams = run_cfg.num_beams
+            max_len = run_cfg.max_len
+            min_len = run_cfg.min_len
+            evaluate = run_cfg.evaluate
+            report_metric = run_cfg.get("report_metric", True)
+            return cls(
+                num_beams=num_beams,
+                max_len=max_len,
+                min_len=min_len,
+                evaluate=evaluate,
+                report_metric=report_metric,
+            )
+        def valid_step(self, model, samples):
+            results = []
+            loss = model(samples)["loss"].item()
+            return [loss]
+        ...
+Note that for any new task, we advise the users to review carefully the functions implemented within ``BaseTask`` and consider which methods should be modified.
+For instance, the base task class already contains a standard implementation of model training steps that are common among machine learning steps.
+Some major methods we want to emphasize and should be customized by each task are the ``valid_step`` and ``evaluation``.
+These operations were not fully implemented in the base task class due to the differences in evaluation procedures among many machine learning tasks.
+Another method that should be considered is the ``setup_task`` method.
+This method will receive configurations that set task-specific parameters to initialize any task instance.
+Registering New Task ``lavis.tasks.__init__``
+********************************************************************************
+Any new task must be officially registered as part of the ``lavis.tasks`` module. For instance, to add a new task for video-grounded dialogues, we can modify the ``__init__.py`` as follows:
+.. code-block:: python
+    from lavis.tasks.dialogue import DialogueTask
+    ...
+    __all__ = [
+        ...
+        "DialogueTask"
+    ]
+Assigning Task
+***************
+From the above example of task class, note that we define a ``setup_task`` method for each task class.
+This method will process a configuration file and pass specific parameters e.g. ``num_beams`` (for beam search generative tasks during the inference stage), to initialize the task classes properly.
+To assign and associate any task, we need to specify the correct registry of task classes in a configuration file.
+For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:
+.. code-block:: yaml
+    run:
+      task: dialogue # name of the task
+      # optimizer
+      ...
+      max_len: 20
+      min_len: 5
+      num_beams: 3
+      ...
+Subsequently, any processes (e.g. training) should load this configuration file to assign the correct task.
+.. code-block:: sh
+    python train.py --cfg-path dialogue_avsd_ft.yaml

docs/tutorial.training-example.rst ADDED Viewed

	@@ -0,0 +1,145 @@

+Example on Finetuning BLIP on COCO-Captioning
+################################################
+To finetune BLIP model on the coco caption dataset, first refer to :ref:`prep coco` to prepare the dataset if you have not done so.
+To finetune the model, we have prepared a run script for you, which can run as follows:
+.. code-block:: bash
+    bash run_scripts/lavis/blip/train/train_caption_coco_large.sh
+This will finetune the pre-trained BLIP large model into a new model that can be used for captioning.
+Deep Dive
+**********
+Now let's take a closer look at the script and see what it does.
+.. code-block:: bash
+    python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/caption_coco_large_ft.yaml
+As can be seen, the script simply calls the :code:`train.py` with PyTorch distributed training enabled.
+The :code:`--cfg-path` argument specifies the **runtime config** file to use. The config file is a YAML file that specifies the training parameters, shown as follows:
+.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
+    :language: yaml
+    :linenos:
+The runtime config file is divided into 3 sections:
+    - :code:`model`: specifies the model architecture and type to use.
+    - :code:`data`: specifies the dataset to use.
+    - :code:`run`: specifies the runner arguments, such as tasks, optimizer, learning rate scheduler, etc.
+We describe each section in detail below.
+Model configurations
+=====================
+.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
+    :language: yaml
+    :linenos:
+    :lines: 6-10
+The :code:`arch` argument specifies the model architecture to use. In this case, we use the :code:`blip_caption` architecture.
+You can find available architectures by inspecting the :code:`model_zoo`.
+Once the architecture is specified, the runner will look for the model class registered with the name and try to instantiate a model instance.
+In this case :code:`BlipCaption` is the model registered with the name :code:`blip_caption`.
+The registry maintains a mapping from the name string to the model class.
+This allows the runner to find the model class dynamically based on the name string from the config file.
+The following segment in :code:`lavis/models/blip_models/blip_caption.py` shows how :code:`BlipCaption` is registered with the name string :code:`blip_caption`:
+.. literalinclude:: ../lavis/models/blip_models/blip_caption.py
+    :language: python
+    :linenos:
+    :lines: 20-38
+One same model architecture may be pre-trained or finetuned on different datasets or have different model configurations.
+For example, :code:`BlipCaption` have:
+    - :code:`base_coco`: pre-trained base BLIP model adapated for COCO captioning finetuning.
+    - :code:`large_coco`: pre-trained large BLIP model adapated for COCO captioning finetuning.
+Therefore, we also need to specify :code:`model_type`. Here we use :code:`large_coco`.
+And we set :code:`load_finetuned` to :code:`False` to indicate that we are finetuning the model from the pre-trained weights.
+If :code:`load_finetuned` set to :code:`True` as by default, the model will load finetuned weights on coco captioning.
+Given the model architecture and type, the library will then look for the default model config for :code:`large_coco` in :code:`lavis/models/blip_models/blip_caption.py`.
+As can be seen in the above code snippet, the corresponding config path is stored in :code:`BlipCaption.PRETRAINED_MODEL_CONFIG_DICT`.
+Then the library will load :code:`lavis/configs/models/blip_caption_large_coco.yaml` as the configuration to build the model.
+*Priority of Configs*: Note that the priority of the run config is higher than the default model config, meaning that arguments in the run config will override the default model config.
+For example, in the default model config, :code:`load_finetuned` is set to :code:`True` by default, while in the run config, we set it to :code:`False` and finetuning from the pre-trained weights only.
+Dataset configurations
+=========================
+The second section of the config file specifies the dataset(s) to use.
+.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
+    :language: yaml
+    :linenos:
+    :lines: 12-24
+We associate each dataset with a :code:`vis_processor` and a :code:`text_processor`, responsible for processing the visual and textual input respectively.
+Here we again use the registry mechanism to dynamically load the processor class based on the name string.
+For example, :code:`blip_image_train` is the name string for the :code:`BlipImageTrainProcessor` class, which is registered in :code:`lavis/processors/blip_processors.py`.
+Similarly, the dataset name string is also registered in the registry, pointing to a dataset builder :code:`COCOCapBuilder` class.
+By default, the builder will load the default dataset configuration as in :code:`DATASET_CONFIG_DICT`. You may also add new dataset types by adding new entries to the dictionary.
+The dataset configuration used here is:
+.. literalinclude:: ../lavis/configs/datasets/coco/defaults_cap.yaml
+    :language: yaml
+    :linenos:
+    :lines: 6-28
+In this configuration file, we specify the dataset name and mainly its building information.
+The build information is divided into two parts: :code:`annotation` and :code:`images`. The annotation files will be automatically downloaded upon loading the dataset for the first time.
+The :code:`images` part specifies the image root directory. This is a relative path to the cache directory, which is :code:`cache` by default. If you have a local copy of the dataset, you can specify the path to the local copy by
+overwriting the :code:`images` part in the runtime config file. For example, you may alter the run config as below to use your local dataset copy:
+.. code:: yaml
+    datasets:
+        coco_caption: # name of the dataset builder
+            vis_processor:
+                train:
+                name: "blip_image_train"
+                eval:
+                name: "blip_image_eval"
+            text_processor:
+                train:
+                name: "blip_caption"
+                prompt: "a picture of "
+                eval:
+                name: "blip_caption"
+            images:
+                YOUR_LOCAL_IMAGE_ROOT_DIR
+LAVIS supports using multiple datasets for training. See an example in :code:`lavis/projects/blip/train/pretrain_14m.yaml`.
+Runner configurations
+=========================
+The last section of the config file specifies the arguments for the runner, shown below:
+.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
+    :language: yaml
+    :linenos:
+    :lines: 26-56
+Here we specify runner-related arguments, including
+    - task-specific arguments, such as :code:`task`, :code:`max_len`, :code:`min_len`, etc.
+    - learning rate schedulers, optimizer;
+    - distributed training settings;
+    - logging and checkpointing settings.
+Available Configurations
+#########################
+See :ref:`config` for the full list of available configurations and their descriptions.

evaluate.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+ Copyright (c) 2022, salesforce.com, inc.
+ All rights reserved.
+ SPDX-License-Identifier: BSD-3-Clause
+ For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import argparse
+import random
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import lavis.tasks as tasks
+from lavis.common.config import Config
+from lavis.common.dist_utils import get_rank, init_distributed_mode
+from lavis.common.logger import setup_logger
+from lavis.common.optims import (
+    LinearWarmupCosineLRScheduler,
+    LinearWarmupStepLRScheduler,
+)
+from lavis.common.utils import now
+# imports modules for registration
+from lavis.datasets.builders import *
+from lavis.models import *
+from lavis.processors import *
+from lavis.runners.runner_base import RunnerBase
+from lavis.tasks import *
+def parse_args():
+    parser = argparse.ArgumentParser(description="Training")
+    parser.add_argument("--cfg-path", required=True, help="path to configuration file.")
+    parser.add_argument(
+        "--options",
+        nargs="+",
+        help="override some settings in the used config, the key-value pair "
+        "in xxx=yyy format will be merged into config file (deprecate), "
+        "change to --cfg-options instead.",
+    )
+    args = parser.parse_args()
+    # if 'LOCAL_RANK' not in os.environ:
+    #     os.environ['LOCAL_RANK'] = str(args.local_rank)
+    return args
+def setup_seeds(config):
+    seed = config.run_cfg.seed + get_rank()
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    cudnn.benchmark = False
+    cudnn.deterministic = True
+def main():
+    # allow auto-dl completes on main process without timeout when using NCCL backend.
+    # os.environ["NCCL_BLOCKING_WAIT"] = "1"
+    # set before init_distributed_mode() to ensure the same job_id shared across all ranks.
+    job_id = now()
+    cfg = Config(parse_args())
+    init_distributed_mode(cfg.run_cfg)
+    setup_seeds(cfg)
+    # set after init_distributed_mode() to only log on master.
+    setup_logger()
+    cfg.pretty_print()
+    task = tasks.setup_task(cfg)
+    datasets = task.build_datasets(cfg)
+    model = task.build_model(cfg)
+    runner = RunnerBase(
+        cfg=cfg, job_id=job_id, task=task, model=model, datasets=datasets
+    )
+    runner.evaluate(skip_reload=True)
+if __name__ == "__main__":
+    main()

lavis/.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

lavis/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+ Copyright (c) 2022, salesforce.com, inc.
+ All rights reserved.
+ SPDX-License-Identifier: BSD-3-Clause
+ For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import os
+import sys
+from omegaconf import OmegaConf
+from lavis.common.registry import registry
+from lavis.datasets.builders import *
+from lavis.models import *
+from lavis.processors import *
+from lavis.tasks import *
+root_dir = os.path.dirname(os.path.abspath(__file__))
+default_cfg = OmegaConf.load(os.path.join(root_dir, "configs/default.yaml"))
+registry.register_path("library_root", root_dir)
+repo_root = os.path.join(root_dir, "..")
+registry.register_path("repo_root", repo_root)
+cache_root = os.path.join(repo_root, default_cfg.env.cache_root)
+registry.register_path("cache_root", cache_root)
+registry.register("MAX_INT", sys.maxsize)
+registry.register("SPLIT_NAMES", ["train", "val", "test"])

lavis/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (988 Bytes). View file

lavis/common/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

lavis/common/__pycache__/config.cpython-38.pyc ADDED Viewed

Binary file (12.1 kB). View file

lavis/common/__pycache__/dist_utils.cpython-38.pyc ADDED Viewed

Binary file (3.76 kB). View file