File size: 3,448 Bytes
af72c45
 
 
cd530cf
 
af72c45
cd530cf
 
 
 
 
 
 
 
 
 
 
 
 
 
b0e8ca7
 
cd530cf
 
 
 
b0e8ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af72c45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project log"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Monday 6.26.2023\n",
    "\n",
    "- Created log to record important progress\n",
    "\n",
    "Restructuring project\n",
    "- Remove extraneous files from \n",
    "the data folder. The relevant 'data' for this project consists of \n",
    "    - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
    "    - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thursday 6.29.2023\n",
    "\n",
    "- Created data_storage.py\n",
    "- This houses all functions and classes related to data storage\n",
    "- Created a class ArXivData to store arxiv metadata\n",
    "    - Designed to be passed to embedding class to vectorize\n",
    "    - The embedding class should call the cleaning methods under the hood.\n",
    "- Can load raw metadata from a query. \n",
    "    - Only stores id, title, abstract, categories\n",
    "    - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
    "\n",
    "#### Todo: Write ArXivData methods\n",
    "    1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
    "    1. `load_from_file`: load arxiv data from a parquet file.\n",
    "    1. `Save_to_file`: to store data as a parquet\n",
    "    2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
    "        - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
    "        actually load the webpages.\n",
    "     \n",
    "\n",
    "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
    "    - Make one column for arxiv tags, one column for msc tags\n",
    "    - store msc tags as their english names in a list\n",
    "    - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
    "\n",
    "#### Idea for the pipeline\n",
    "    1. Load data in the ArXivData class whether from file or from query\n",
    "    2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
    "        - What exactly is needed?\n",
    "    3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
    "\n",
    "#### EDA tools needed\n",
    "    1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
    "    2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
    "        - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}