anthony-chen's picture
kek
a1551fc
import streamlit as st
paragraph_1 = """
The Autograder's underlying software design utilizes a RAG (Retrieval-Augmented Generation) pipeline.
The context items consist of 3,717 2-d molecule images sourced from PubChem and saved into a
local directory(storing on a relational database is something to explore). Pubchem comes with a downloadable
csv file that can easily be converted to a dataframe, which gives information on the molecule's cid and compound name.
The embeddings are decoded through BERT and these embeddings are stored the vector database Milvus, where the molecule's
cid served as the index.
The retrieval system returns the cid of the molecule with the highest semantic score. The intuition behind this decision
choice was that molecules can have multiple names, so a simple keyword search would not be as versatile. Currently, it
has not gone through much testing but is able to produce accurate results sometimes when fed in a molecule compound
synonym.
The underlying VLM is Google/PaliGemma-3b and is currently not fine-tuned(hence the inaccurate results). The datasets that
we have access to are too noisy and we simply do not have enough images to fine-tune a "lightweight" VLM such as
PaliGemma.
"""
paragraph_2 = """
The Custom Image Upload Autograder's underlying software design employs a straightforward prompt engineering approach.
The core model used is PaliGemma. This model was developed as a contingency measure to address potential reliability issues
that may arise with the retrieval system in the future.
"""
paragraph_3 = """
The next step would be to fine-tune our underlying VLM. However, before this is feasible, we would need to gather
a large corpus of images that is clean and organized well.
To improve the retrieval system, there would need to be much more molecules uploaded and stored on Milvus. Mapping
these to a relational database would also be necessary is storing these on a local directory at this large of a scale
is unfeasible.
The Streamlit frontend could also be improved to make it more user-friendly and functional. This could include a
better layout, more interactive features, and smooth integration with the backend for efficient data retrieval
and processing.
"""
st.title("About")
st.markdown("""DISCLAIMER!
The underlying VLM is not finetuned, so the quality of the outputs are unreliable.
""")
st.header("About the Autograder")
st.markdown(paragraph_1)
st.header("About Manual Upload")
st.markdown(paragraph_2)
st.header("Moving Onwards")
st.markdown(paragraph_3)