Spaces:
Sleeping
Sleeping
import streamlit as st | |
paragraph_1 = """ | |
The Autograder's underlying software design utilizes a RAG (Retrieval-Augmented Generation) pipeline. | |
The context items consist of 3,717 2-d molecule images sourced from PubChem and saved into a | |
local directory(storing on a relational database is something to explore). Pubchem comes with a downloadable | |
csv file that can easily be converted to a dataframe, which gives information on the molecule's cid and compound name. | |
The embeddings are decoded through BERT and these embeddings are stored the vector database Milvus, where the molecule's | |
cid served as the index. | |
The retrieval system returns the cid of the molecule with the highest semantic score. The intuition behind this decision | |
choice was that molecules can have multiple names, so a simple keyword search would not be as versatile. Currently, it | |
has not gone through much testing but is able to produce accurate results sometimes when fed in a molecule compound | |
synonym. | |
The underlying VLM is Google/PaliGemma-3b and is currently not fine-tuned(hence the inaccurate results). The datasets that | |
we have access to are too noisy and we simply do not have enough images to fine-tune a "lightweight" VLM such as | |
PaliGemma. | |
""" | |
paragraph_2 = """ | |
The Custom Image Upload Autograder's underlying software design employs a straightforward prompt engineering approach. | |
The core model used is PaliGemma. This model was developed as a contingency measure to address potential reliability issues | |
that may arise with the retrieval system in the future. | |
""" | |
paragraph_3 = """ | |
The next step would be to fine-tune our underlying VLM. However, before this is feasible, we would need to gather | |
a large corpus of images that is clean and organized well. | |
To improve the retrieval system, there would need to be much more molecules uploaded and stored on Milvus. Mapping | |
these to a relational database would also be necessary is storing these on a local directory at this large of a scale | |
is unfeasible. | |
The Streamlit frontend could also be improved to make it more user-friendly and functional. This could include a | |
better layout, more interactive features, and smooth integration with the backend for efficient data retrieval | |
and processing. | |
""" | |
st.title("About") | |
st.markdown("""DISCLAIMER! | |
The underlying VLM is not finetuned, so the quality of the outputs are unreliable. | |
""") | |
st.header("About the Autograder") | |
st.markdown(paragraph_1) | |
st.header("About Manual Upload") | |
st.markdown(paragraph_2) | |
st.header("Moving Onwards") | |
st.markdown(paragraph_3) | |