Spaces:

anthony-chen
/

Chem-210-Autograder

Sleeping

App Files Files Community

Chem-210-Autograder / pages /3_About.py

anthony-chen

kek

a1551fc 6 months ago

raw

history blame contribute delete

2.57 kB

	import streamlit as st

	paragraph_1 = """
	The Autograder's underlying software design utilizes a RAG (Retrieval-Augmented Generation) pipeline.

	The context items consist of 3,717 2-d molecule images sourced from PubChem and saved into a
	local directory(storing on a relational database is something to explore). Pubchem comes with a downloadable
	csv file that can easily be converted to a dataframe, which gives information on the molecule's cid and compound name.
	The embeddings are decoded through BERT and these embeddings are stored the vector database Milvus, where the molecule's
	cid served as the index.

	The retrieval system returns the cid of the molecule with the highest semantic score. The intuition behind this decision
	choice was that molecules can have multiple names, so a simple keyword search would not be as versatile. Currently, it
	has not gone through much testing but is able to produce accurate results sometimes when fed in a molecule compound
	synonym.

	The underlying VLM is Google/PaliGemma-3b and is currently not fine-tuned(hence the inaccurate results). The datasets that
	we have access to are too noisy and we simply do not have enough images to fine-tune a "lightweight" VLM such as
	PaliGemma.

	"""
	paragraph_2 = """
	The Custom Image Upload Autograder's underlying software design employs a straightforward prompt engineering approach.

	The core model used is PaliGemma. This model was developed as a contingency measure to address potential reliability issues
	that may arise with the retrieval system in the future.
	"""
	paragraph_3 = """
	The next step would be to fine-tune our underlying VLM. However, before this is feasible, we would need to gather
	a large corpus of images that is clean and organized well.

	To improve the retrieval system, there would need to be much more molecules uploaded and stored on Milvus. Mapping
	these to a relational database would also be necessary is storing these on a local directory at this large of a scale
	is unfeasible.

	The Streamlit frontend could also be improved to make it more user-friendly and functional. This could include a
	better layout, more interactive features, and smooth integration with the backend for efficient data retrieval
	and processing.
	"""

	st.title("About")
	st.markdown("""DISCLAIMER!
	The underlying VLM is not finetuned, so the quality of the outputs are unreliable.
	""")
	st.header("About the Autograder")
	st.markdown(paragraph_1)
	st.header("About Manual Upload")
	st.markdown(paragraph_2)
	st.header("Moving Onwards")
	st.markdown(paragraph_3)