|
import streamlit as st |
|
import streamlit.components.v1 as components |
|
|
|
|
|
def run_model_arch() -> None: |
|
""" |
|
Displays the model architecture and accompanying abstract and design details for the Knowledge-Based Visual Question |
|
Answering (KB-VQA) model. |
|
This function reads an HTML file containing the model architecture and renders it in a Streamlit application. |
|
It also provides detailed descriptions of the research, abstract, and design of the KB-VQA model. |
|
|
|
Returns: |
|
None |
|
""" |
|
|
|
|
|
with open("Files/Model Arch.html", 'r', encoding='utf-8') as f: |
|
model_arch_html = f.read() |
|
|
|
col1, col2 = st.columns(2) |
|
with col1: |
|
st.markdown("#### Model Architecture") |
|
components.html(model_arch_html, height=1600) |
|
with col2: |
|
st.markdown("#### Abstract") |
|
st.markdown(""" |
|
<div style="text-align: justify;"> |
|
|
|
Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge |
|
the gap between visual perception and linguistic interpretation, a foundational challenge in artificial |
|
intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the |
|
pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks. |
|
This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence |
|
of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have |
|
transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle |
|
complex tasks, thereby enhancing KB-VQA systems. |
|
|
|
An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined |
|
approach that converts visual content into the linguistic domain, creating detailed captions and object |
|
enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The |
|
research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability |
|
to interpret visual contexts. The research also reviews current image representation techniques and knowledge |
|
sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not |
|
require specialized expertise. |
|
|
|
Rigorous ablation experiments conducted to assess the impact of various visual context elements on model |
|
performance, with a particular focus on the importance of image descriptions generated during the captioning |
|
phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus, |
|
and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment |
|
with practical application needs. |
|
|
|
The evaluation results underscore the developed model’s competent and competitive performance. It achieves a |
|
VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further, |
|
semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%, |
|
respectively. These results demonstrate that the model effectively applies reasoning over the visual context |
|
and successfully retrieves the necessary knowledge to answer visual questions. |
|
</div> |
|
""", unsafe_allow_html=True) |
|
|
|
st.markdown("<br>" * 2, unsafe_allow_html=True) |
|
st.markdown("#### Design") |
|
st.markdown(""" |
|
<div style="text-align: justify;"> |
|
|
|
As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to |
|
Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning |
|
and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models, |
|
selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more |
|
advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological |
|
advancement. |
|
|
|
Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects, |
|
along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing |
|
a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model |
|
(PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an |
|
informed response to the question. |
|
</div> |
|
""", unsafe_allow_html=True) |
|
|