{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Project log" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Monday 6.26.2023\n", "\n", "- Created log to record important progress\n", "\n", "Restructuring project\n", "- Remove extraneous files from \n", "the data folder. The relevant 'data' for this project consists of \n", " - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n", " - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Thursday 6.29.2023\n", "\n", "- Created data_storage.py\n", "- This houses all functions and classes related to data storage\n", "- Created a class ArXivData to store arxiv metadata\n", " - Designed to be passed to embedding class to vectorize\n", " - The embedding class should call the cleaning methods under the hood.\n", "- Can load raw metadata from a query. \n", " - Only stores id, title, abstract, categories\n", " - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n", "\n", "#### Todo: Write ArXivData methods\n", " 1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n", " 1. `load_from_file`: load arxiv data from a parquet file.\n", " 1. `Save_to_file`: to store data as a parquet\n", " 2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n", " - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n", " actually load the webpages.\n", " \n", "\n", "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n", " - Make one column for arxiv tags, one column for msc tags\n", " - store msc tags as their english names in a list\n", " - store arxiv cats one-hot-encoded as a separate dataframe attribute\n", "\n", "#### Idea for the pipeline\n", " 1. Load data in the ArXivData class whether from file or from query\n", " 2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n", " - What exactly is needed?\n", " 3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n", "\n", "#### EDA tools needed\n", " 1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n", " 2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n", " - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 07/02/2023\n", "\n", "-Read medium article about using config files to set up highly modular data analysis pipelines.\n", "-Interested in setting this up here\n", "\n", "#### Outline of pipeline architecture\n", "\n", "1. Load dataset \n", " - option to load from file or from querying arxiv directly\n", " - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n", "2. Load embeddings\n", " - option to load from file or generate using sentence transformers directly.\n", " - any data cleaning procedures will occur in the pipeline here\n", "3. Plug into topic model(s)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 07/03/2023\n", "\n", "#### Modified data_storage.py\n", "\n", "Done:\n", "1. Wrote `load_from_feather` and `save_to_feather`\n", "1. Pulled and stored metadata for 40k papers in pde and spectral theory called 'APSP_40.feather'\n", "\n", "To Do:\n", "1. Write comments for the methods in the arXivData class.\n", "1. Make sure the class functionality works correctly when a query returns no results.\n", "\n", "\n", "#### Miscellaneous\n", "1. Install `tabbed out` extension for exiting delimiter environments with tab.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 07/04/2023\n", "\n", "#### Create embedding module, `embedding.py`\n", "\n", "Functions\n", "1. Take in an arXivData class object\n", "1. generate embeddings for the clean text\n", "1. compute the most semantically similar msc tags\n", "1. output the np array containing the embeddings\n", "1. output the np array in which row i is \n", " - the embedding vector of the most similar msc tag, if there are msc tags\n", " - NAN if there are no msc tags.\n", "\n", "\n", "Stopping in the middle of step 3, which is the function `rank_msc_tags` in embedding.py\n", "\n", "need to add the dataclass decorator from the data storage module to my arXivData class.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 07/14/2023\n", "\n", "#### Modified embedding module\n", "1. Added functions to generate and load the embeddings of msc and arxiv subject tags.\n", "2. Saved these embedding in the data directory as parquet files, with the index of a row equal to the word that row vector encodes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }