{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Project log" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Monday 6.26.2023\n", "\n", "- Created log to record important progress\n", "\n", "Restructuring project\n", "- Remove extraneous files from \n", "the data folder. The relevant 'data' for this project consists of \n", " - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n", " - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Thursday 6.29.2023\n", "\n", "- Created data_storage.py\n", "- This houses all functions and classes related to data storage\n", "- Created a class ArXivData to store arxiv metadata\n", " - Designed to be passed to embedding class to vectorize\n", " - The embedding class should call the cleaning methods under the hood.\n", "- Can load raw metadata from a query. \n", " - Only stores id, title, abstract, categories\n", " - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n", "\n", "#### Todo: Write ArXivData methods\n", " 1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n", " 1. `load_from_file`: load arxiv data from a parquet file.\n", " 1. `Save_to_file`: to store data as a parquet\n", " 2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n", " - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n", " actually load the webpages.\n", " \n", "\n", "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n", " - Make one column for arxiv tags, one column for msc tags\n", " - store msc tags as their english names in a list\n", " - store arxiv cats one-hot-encoded as a separate dataframe attribute\n", "\n", "#### Idea for the pipeline\n", " 1. Load data in the ArXivData class whether from file or from query\n", " 2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n", " - What exactly is needed?\n", " 3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n", "\n", "#### EDA tools needed\n", " 1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n", " 2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n", " - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 07/02/2023\n", "\n", "-Read medium article about using config files to set up highly modular data analysis pipelines.\n", "-Interested in setting this up here\n", "\n", "#### Outline of pipeline architecture\n", "\n", "1. Load dataset \n", " - option to load from file or from querying arxiv directly\n", " - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n", "2. Load embeddings\n", " - option to load from file or generate using sentence transformers directly.\n", " - any data cleaning procedures will occur in the pipeline here\n", "3. Plug into topic model(s)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }