File size: 6,213 Bytes
af72c45
 
 
cd530cf
 
af72c45
cd530cf
 
 
 
 
 
 
 
 
 
 
 
 
 
b0e8ca7
 
cd530cf
 
 
 
b0e8ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0ee416
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd5ad0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
756497b
cd5ad0c
 
 
 
 
 
3684daa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e9a50f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af72c45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project log"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Monday 6.26.2023\n",
    "\n",
    "- Created log to record important progress\n",
    "\n",
    "Restructuring project\n",
    "- Remove extraneous files from \n",
    "the data folder. The relevant 'data' for this project consists of \n",
    "    - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
    "    - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thursday 6.29.2023\n",
    "\n",
    "- Created data_storage.py\n",
    "- This houses all functions and classes related to data storage\n",
    "- Created a class ArXivData to store arxiv metadata\n",
    "    - Designed to be passed to embedding class to vectorize\n",
    "    - The embedding class should call the cleaning methods under the hood.\n",
    "- Can load raw metadata from a query. \n",
    "    - Only stores id, title, abstract, categories\n",
    "    - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
    "\n",
    "#### Todo: Write ArXivData methods\n",
    "    1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
    "    1. `load_from_file`: load arxiv data from a parquet file.\n",
    "    1. `Save_to_file`: to store data as a parquet\n",
    "    2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
    "        - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
    "        actually load the webpages.\n",
    "     \n",
    "\n",
    "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
    "    - Make one column for arxiv tags, one column for msc tags\n",
    "    - store msc tags as their english names in a list\n",
    "    - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
    "\n",
    "#### Idea for the pipeline\n",
    "    1. Load data in the ArXivData class whether from file or from query\n",
    "    2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
    "        - What exactly is needed?\n",
    "    3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
    "\n",
    "#### EDA tools needed\n",
    "    1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
    "    2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
    "        - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07/02/2023\n",
    "\n",
    "-Read medium article about using config files to set up highly modular data analysis pipelines.\n",
    "-Interested in setting this up here\n",
    "\n",
    "#### Outline of pipeline architecture\n",
    "\n",
    "1. Load dataset \n",
    "    - option to load from file or from querying arxiv directly\n",
    "    - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n",
    "2. Load embeddings\n",
    "    - option to load from file or generate using sentence transformers directly.\n",
    "    - any data cleaning procedures will occur in the pipeline here\n",
    "3. Plug into topic model(s)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07/03/2023\n",
    "\n",
    "#### Modified data_storage.py\n",
    "\n",
    "Done:\n",
    "1. Wrote `load_from_feather` and `save_to_feather`\n",
    "1. Pulled and stored metadata for 40k papers in pde and spectral theory called 'APSP_40.feather'\n",
    "\n",
    "To Do:\n",
    "1. Write comments for the methods in the arXivData class.\n",
    "1. Make sure the class functionality works correctly when a query returns no results.\n",
    "\n",
    "\n",
    "#### Miscellaneous\n",
    "1. Install `tabbed out` extension for exiting delimiter environments with tab.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07/04/2023\n",
    "\n",
    "#### Create embedding module, `embedding.py`\n",
    "\n",
    "Functions\n",
    "1. Take in an arXivData class object\n",
    "1. generate embeddings for the clean text\n",
    "1. compute the most semantically similar msc tags\n",
    "1. output the np array containing the embeddings\n",
    "1. output the np array in which row i is \n",
    "    - the embedding vector of the most similar msc tag, if there are msc tags\n",
    "    - NAN if there are no msc tags.\n",
    "\n",
    "\n",
    "Stopping in the middle of step 3, which is the function `rank_msc_tags` in embedding.py\n",
    "\n",
    "need to add the dataclass decorator from the data storage module to my arXivData class.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07/14/2023\n",
    "\n",
    "#### Modified embedding module\n",
    "1. Added functions to generate and load the embeddings of msc and arxiv subject tags.\n",
    "2. Saved these embedding in the data directory as parquet files, with the index of a row equal to the word that row vector encodes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}