File size: 5,087 Bytes
36132b7
 
 
1ee4a06
8cb43ef
 
 
 
 
 
6d00664
8cb43ef
 
 
 
 
30bdce8
36132b7
 
6d00664
36132b7
 
6cca747
36132b7
530fdf0
7aaa8ae
 
498fce8
7aaa8ae
 
 
 
 
 
 
 
5234b83
7aaa8ae
 
 
 
 
 
 
5234b83
7aaa8ae
 
 
 
 
5234b83
7aaa8ae
 
 
 
 
 
 
6d00664
42a0492
530fdf0
7aaa8ae
 
498fce8
7aaa8ae
 
 
 
 
 
5234b83
7aaa8ae
 
 
 
 
 
6d00664
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import streamlit as st
import streamlit.components.v1 as components


def run_model_arch() -> None:
    """
    Displays the model architecture and accompanying abstract and design details for the Knowledge-Based Visual Question
    Answering (KB-VQA) model.
    This function reads an HTML file containing the model architecture and renders it in a Streamlit application.
    It also provides detailed descriptions of the research, abstract, and design of the KB-VQA model.
    
    Returns:
        None
    """

    # Read the model architecture HTML file
    with open("Files/Model Arch.html", 'r', encoding='utf-8') as f:
        model_arch_html = f.read()

    col1, col2 = st.columns(2)
    with col1:
        st.markdown("#### Model Architecture")
        components.html(model_arch_html, height=1600)
    with col2:
        st.markdown("#### Abstract")
        st.markdown("""
        <div style="text-align: justify;">
        
        Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge 
        the gap between visual perception and linguistic interpretation, a foundational challenge in artificial 
        intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the 
        pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks.
        This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence 
        of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have 
        transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle 
        complex tasks, thereby enhancing KB-VQA systems.
        
        An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined 
        approach that converts visual content into the linguistic domain, creating detailed captions and object 
        enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The 
        research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability 
        to interpret visual contexts. The research also reviews current image representation techniques and knowledge 
        sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not 
        require specialized expertise.
        
        Rigorous ablation experiments conducted to assess the impact of various visual context elements on model 
        performance, with a particular focus on the importance of image descriptions generated during the captioning 
        phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus, 
        and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment 
        with practical application needs.
        
        The evaluation results underscore the developed model’s competent and competitive performance. It achieves a 
        VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further, 
        semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%, 
        respectively. These results demonstrate that the model effectively applies reasoning over the visual context 
        and successfully retrieves the necessary knowledge to answer visual questions.
        </div>
        """, unsafe_allow_html=True)

        st.markdown("<br>" * 2, unsafe_allow_html=True)
        st.markdown("#### Design")
        st.markdown("""
        <div style="text-align: justify;">
        
        As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to 
        Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning 
        and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models, 
        selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more 
        advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological 
        advancement.
        
        Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects, 
        along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing 
        a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model 
        (PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an 
        informed response to the question.
        </div>
        """, unsafe_allow_html=True)