{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project log"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Monday 6.26.2023\n",
    "\n",
    "- Created log to record important progress\n",
    "\n",
    "Restructuring project\n",
    "- Remove extraneous files from \n",
    "the data folder. The relevant 'data' for this project consists of \n",
    "    - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
    "    - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thursday 6.29.2023\n",
    "\n",
    "- Created data_storage.py\n",
    "- This houses all functions and classes related to data storage\n",
    "- Created a class ArXivData to store arxiv metadata\n",
    "    - Designed to be passed to embedding class to vectorize\n",
    "    - The embedding class should call the cleaning methods under the hood.\n",
    "- Can load raw metadata from a query. \n",
    "    - Only stores id, title, abstract, categories\n",
    "    - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
    "\n",
    "#### Todo: Write ArXivData methods\n",
    "    1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
    "    1. `load_from_file`: load arxiv data from a parquet file.\n",
    "    1. `Save_to_file`: to store data as a parquet\n",
    "    2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
    "        - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
    "        actually load the webpages.\n",
    "     \n",
    "\n",
    "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
    "    - Make one column for arxiv tags, one column for msc tags\n",
    "    - store msc tags as their english names in a list\n",
    "    - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
    "\n",
    "#### Idea for the pipeline\n",
    "    1. Load data in the ArXivData class whether from file or from query\n",
    "    2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
    "        - What exactly is needed?\n",
    "    3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
    "\n",
    "#### EDA tools needed\n",
    "    1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
    "    2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
    "        - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07/02/2023\n",
    "\n",
    "-Read medium article about using config files to set up highly modular data analysis pipelines.\n",
    "-Interested in setting this up here\n",
    "\n",
    "#### Outline of pipeline architecture\n",
    "\n",
    "1. Load dataset \n",
    "    - option to load from file or from querying arxiv directly\n",
    "    - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n",
    "2. Load embeddings\n",
    "    - option to load from file or generate using sentence transformers directly.\n",
    "    - any data cleaning procedures will occur in the pipeline here\n",
    "3. Plug into topic model(s)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}