Michael-Geis commited on
Commit
b0e8ca7
1 Parent(s): 9b818c8

see 6.29 log notes for these changes

Browse files
collection.ipynb CHANGED
@@ -35953,6 +35953,231 @@
35953
  "pd.set_option('display.max_colwidth', 0)\n",
35954
  "clean_data.head()"
35955
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35956
  }
35957
  ],
35958
  "metadata": {
 
35953
  "pd.set_option('display.max_colwidth', 0)\n",
35954
  "clean_data.head()"
35955
  ]
35956
+ },
35957
+ {
35958
+ "cell_type": "code",
35959
+ "execution_count": 168,
35960
+ "metadata": {},
35961
+ "outputs": [
35962
+ {
35963
+ "data": {
35964
+ "text/html": [
35965
+ "<div>\n",
35966
+ "<style scoped>\n",
35967
+ " .dataframe tbody tr th:only-of-type {\n",
35968
+ " vertical-align: middle;\n",
35969
+ " }\n",
35970
+ "\n",
35971
+ " .dataframe tbody tr th {\n",
35972
+ " vertical-align: top;\n",
35973
+ " }\n",
35974
+ "\n",
35975
+ " .dataframe thead th {\n",
35976
+ " text-align: right;\n",
35977
+ " }\n",
35978
+ "</style>\n",
35979
+ "<table border=\"1\" class=\"dataframe\">\n",
35980
+ " <thead>\n",
35981
+ " <tr style=\"text-align: right;\">\n",
35982
+ " <th></th>\n",
35983
+ " <th>title</th>\n",
35984
+ " <th>summary</th>\n",
35985
+ " <th>categories</th>\n",
35986
+ " <th>id</th>\n",
35987
+ " </tr>\n",
35988
+ " </thead>\n",
35989
+ " <tbody>\n",
35990
+ " <tr>\n",
35991
+ " <th>0</th>\n",
35992
+ " <td>Boundedness, Ultracontractive Bounds and Optim...</td>\n",
35993
+ " <td>We investigate some regularity properties of a...</td>\n",
35994
+ " <td>[math.AP, 35B65, 35D30, 35K10, 35B45]</td>\n",
35995
+ " <td>2306.17152v1</td>\n",
35996
+ " </tr>\n",
35997
+ " <tr>\n",
35998
+ " <th>1</th>\n",
35999
+ " <td>The compressible Navier-Stokes equations with ...</td>\n",
36000
+ " <td>We show the global existence of a weak solutio...</td>\n",
36001
+ " <td>[math.AP, 35Q30, 76D03, 35K85]</td>\n",
36002
+ " <td>2305.00822v2</td>\n",
36003
+ " </tr>\n",
36004
+ " <tr>\n",
36005
+ " <th>2</th>\n",
36006
+ " <td>A simple reaction-diffusion system as a possib...</td>\n",
36007
+ " <td>Chemotaxis is a directed cell movement in resp...</td>\n",
36008
+ " <td>[math.AP, 92-10]</td>\n",
36009
+ " <td>2211.06933v3</td>\n",
36010
+ " </tr>\n",
36011
+ " <tr>\n",
36012
+ " <th>3</th>\n",
36013
+ " <td>Optimal blowup stability for three-dimensional...</td>\n",
36014
+ " <td>We study corotational wave maps from $(1+3)$-d...</td>\n",
36015
+ " <td>[math.AP, math-ph, math.MP]</td>\n",
36016
+ " <td>2212.08374v3</td>\n",
36017
+ " </tr>\n",
36018
+ " <tr>\n",
36019
+ " <th>4</th>\n",
36020
+ " <td>A Note on $L^1-$contractive property of the so...</td>\n",
36021
+ " <td>In this note, we study the $L^1-$contractive p...</td>\n",
36022
+ " <td>[math.AP]</td>\n",
36023
+ " <td>2306.17064v1</td>\n",
36024
+ " </tr>\n",
36025
+ " <tr>\n",
36026
+ " <th>...</th>\n",
36027
+ " <td>...</td>\n",
36028
+ " <td>...</td>\n",
36029
+ " <td>...</td>\n",
36030
+ " <td>...</td>\n",
36031
+ " </tr>\n",
36032
+ " <tr>\n",
36033
+ " <th>19995</th>\n",
36034
+ " <td>Exact controllability of the linear Zakharov-K...</td>\n",
36035
+ " <td>We consider the linear Zakharov-Kuznetsov equa...</td>\n",
36036
+ " <td>[math.AP]</td>\n",
36037
+ " <td>1912.03066v1</td>\n",
36038
+ " </tr>\n",
36039
+ " <tr>\n",
36040
+ " <th>19996</th>\n",
36041
+ " <td>Blow-up for the pointwise NLS in dimension two...</td>\n",
36042
+ " <td>We consider the Schr\\\"odinger equation in dime...</td>\n",
36043
+ " <td>[math.AP, math-ph, math.FA, math.MP]</td>\n",
36044
+ " <td>1808.10343v4</td>\n",
36045
+ " </tr>\n",
36046
+ " <tr>\n",
36047
+ " <th>19997</th>\n",
36048
+ " <td>Inverse problems with partial data for ellipti...</td>\n",
36049
+ " <td>For a second order formally symmetric elliptic...</td>\n",
36050
+ " <td>[math.AP, math-ph, math.MP]</td>\n",
36051
+ " <td>1912.03047v1</td>\n",
36052
+ " </tr>\n",
36053
+ " <tr>\n",
36054
+ " <th>19998</th>\n",
36055
+ " <td>Inverse problems for the nonlinear modified tr...</td>\n",
36056
+ " <td>This article is devoted to inverse problems fo...</td>\n",
36057
+ " <td>[math.AP]</td>\n",
36058
+ " <td>1912.02996v1</td>\n",
36059
+ " </tr>\n",
36060
+ " <tr>\n",
36061
+ " <th>19999</th>\n",
36062
+ " <td>A non-autonomous bifurcation problem for a non...</td>\n",
36063
+ " <td>In this paper we study the asymptotic behavior...</td>\n",
36064
+ " <td>[math.AP]</td>\n",
36065
+ " <td>1912.02995v1</td>\n",
36066
+ " </tr>\n",
36067
+ " </tbody>\n",
36068
+ "</table>\n",
36069
+ "<p>20000 rows × 4 columns</p>\n",
36070
+ "</div>"
36071
+ ],
36072
+ "text/plain": [
36073
+ " title ... id\n",
36074
+ "0 Boundedness, Ultracontractive Bounds and Optim... ... 2306.17152v1\n",
36075
+ "1 The compressible Navier-Stokes equations with ... ... 2305.00822v2\n",
36076
+ "2 A simple reaction-diffusion system as a possib... ... 2211.06933v3\n",
36077
+ "3 Optimal blowup stability for three-dimensional... ... 2212.08374v3\n",
36078
+ "4 A Note on $L^1-$contractive property of the so... ... 2306.17064v1\n",
36079
+ "... ... ... ...\n",
36080
+ "19995 Exact controllability of the linear Zakharov-K... ... 1912.03066v1\n",
36081
+ "19996 Blow-up for the pointwise NLS in dimension two... ... 1808.10343v4\n",
36082
+ "19997 Inverse problems with partial data for ellipti... ... 1912.03047v1\n",
36083
+ "19998 Inverse problems for the nonlinear modified tr... ... 1912.02996v1\n",
36084
+ "19999 A non-autonomous bifurcation problem for a non... ... 1912.02995v1\n",
36085
+ "\n",
36086
+ "[20000 rows x 4 columns]"
36087
+ ]
36088
+ },
36089
+ "execution_count": 168,
36090
+ "metadata": {},
36091
+ "output_type": "execute_result"
36092
+ }
36093
+ ],
36094
+ "source": [
36095
+ "import data_storage\n",
36096
+ "import importlib\n",
36097
+ "importlib.reload(data_storage)\n",
36098
+ "\n",
36099
+ "\n",
36100
+ "data = data_storage.ArXivData()\n",
36101
+ "\n",
36102
+ "max_results = 20000\n",
36103
+ "offset = 0\n",
36104
+ "data.load_from_query(query_string='cat:math.AP',\n",
36105
+ " max_results=max_results,\n",
36106
+ " offset=offset,\n",
36107
+ " )\n",
36108
+ "data.data"
36109
+ ]
36110
+ },
36111
+ {
36112
+ "cell_type": "code",
36113
+ "execution_count": 51,
36114
+ "metadata": {},
36115
+ "outputs": [
36116
+ {
36117
+ "name": "stdout",
36118
+ "output_type": "stream",
36119
+ "text": [
36120
+ "On the stability of critical points of the Hardy-Littlewood-Sobolev inequality 2023-06-28 01:31:15+00:00\n"
36121
+ ]
36122
+ }
36123
+ ],
36124
+ "source": [
36125
+ "import arxiv\n",
36126
+ "from datetime import datetime , timedelta , timezone\n",
36127
+ "\n",
36128
+ "\n",
36129
+ "search = arxiv.Search(query='cat:math.AP', max_results=1e3,sort_by=arxiv.SortCriterion.LastUpdatedDate, sort_order=arxiv.SortOrder.Descending)\n",
36130
+ "\n",
36131
+ "for result in search.results():\n",
36132
+ " if result.updated < datetime.now(timezone.utc) - timedelta(days=2):\n",
36133
+ " print(result.title,result.updated)\n",
36134
+ " break\n",
36135
+ "\n"
36136
+ ]
36137
+ },
36138
+ {
36139
+ "cell_type": "code",
36140
+ "execution_count": 52,
36141
+ "metadata": {},
36142
+ "outputs": [
36143
+ {
36144
+ "name": "stdout",
36145
+ "output_type": "stream",
36146
+ "text": [
36147
+ "2023-05-16 20:01:32+00:00\n"
36148
+ ]
36149
+ }
36150
+ ],
36151
+ "source": [
36152
+ "##\n",
36153
+ "oldest = list(search.results())[-1]\n",
36154
+ "print(oldest.updated)\n"
36155
+ ]
36156
+ },
36157
+ {
36158
+ "cell_type": "code",
36159
+ "execution_count": 53,
36160
+ "metadata": {},
36161
+ "outputs": [
36162
+ {
36163
+ "name": "stdout",
36164
+ "output_type": "stream",
36165
+ "text": [
36166
+ "2023-05-16 20:01:32+00:00\n"
36167
+ ]
36168
+ }
36169
+ ],
36170
+ "source": [
36171
+ "*_, last = search.results()\n",
36172
+ "print(last.updated)"
36173
+ ]
36174
+ },
36175
+ {
36176
+ "cell_type": "code",
36177
+ "execution_count": null,
36178
+ "metadata": {},
36179
+ "outputs": [],
36180
+ "source": []
36181
  }
36182
  ],
36183
  "metadata": {
arxiv_query_retrieval.py → data_storage.py RENAMED
@@ -23,14 +23,15 @@ class ArXivData():
23
 
24
  self.data = None
25
  self.query = None
26
- self.raw = None
27
  self.categories = None
28
 
29
- def get_from_query(self,query_string,max_results):
30
- self.data = query_to_df(query=query_string,max_results=max_results)
 
 
 
31
  self.query = (query_string,max_results)
32
- self.raw = self.data
33
- self.categories = self.get_OHE_cats()
34
 
35
 
36
  def clean(self,dataset):
@@ -78,7 +79,7 @@ def format_query(author='',title='',cat='',abstract=''):
78
 
79
 
80
 
81
- def query_to_df(query,max_results):
82
  """Returns the results of an arxiv API query in a pandas dataframe.
83
 
84
  Args:
@@ -87,6 +88,8 @@ def query_to_df(query,max_results):
87
 
88
  max_results: positive integer specifying the maximum number of results returned.
89
 
 
 
90
  Returns:
91
  pandas dataframe with one column for indivial piece of metadata of a returned result.
92
  To see a list of these columns and their descriptions, see the documentation for the Results class of the arxiv package here:
@@ -95,22 +98,31 @@ def query_to_df(query,max_results):
95
  The 'links' column is dropped and the authors column is a list of each author's name as a string.
96
  The categories column is also a list of all tags appearing.
97
  """
98
- client = arxiv.Client(page_size=100,num_retries=3)
99
  search = arxiv.Search(
100
  query = query,
101
  max_results=max_results,
102
  sort_by=arxiv.SortCriterion.LastUpdatedDate
103
  )
104
- results = client.results(search)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
- drop_cols = ['authors','links','_raw']
107
- df = pd.DataFrame()
108
 
109
- for result in results:
110
- row_dict = {k : v for (k,v) in vars(result).items() if k not in drop_cols}
111
- row_dict['authors'] = [author.name for author in result.authors]
112
- row_dict['links'] = [link.href for link in result.links]
113
- row = pd.Series(row_dict)
114
- df = pd.concat([df , row.to_frame().transpose()], axis = 0)
115
 
116
- return df.reset_index(drop=True,inplace=False)
 
23
 
24
  self.data = None
25
  self.query = None
 
26
  self.categories = None
27
 
28
+ def load_from_file():
29
+ pass
30
+
31
+ def load_from_query(self,query_string,max_results,offset):
32
+ self.data = query_to_df(query=query_string,max_results=max_results,offset=offset)
33
  self.query = (query_string,max_results)
34
+ #self.categories = self.get_OHE_cats()
 
35
 
36
 
37
  def clean(self,dataset):
 
79
 
80
 
81
 
82
+ def query_to_df(query,max_results,offset):
83
  """Returns the results of an arxiv API query in a pandas dataframe.
84
 
85
  Args:
 
88
 
89
  max_results: positive integer specifying the maximum number of results returned.
90
 
91
+ chunksize:
92
+
93
  Returns:
94
  pandas dataframe with one column for indivial piece of metadata of a returned result.
95
  To see a list of these columns and their descriptions, see the documentation for the Results class of the arxiv package here:
 
98
  The 'links' column is dropped and the authors column is a list of each author's name as a string.
99
  The categories column is also a list of all tags appearing.
100
  """
101
+ client = arxiv.Client(page_size=2000,num_retries=3)
102
  search = arxiv.Search(
103
  query = query,
104
  max_results=max_results,
105
  sort_by=arxiv.SortCriterion.LastUpdatedDate
106
  )
107
+
108
+ columns = ['title','summary','categories','id']
109
+ index = range(offset,max_results)
110
+
111
+
112
+ results = client.results(search,offset=offset)
113
+
114
+ metadata_generator = ((result.title,result.summary,
115
+ result.categories,
116
+ result.entry_id.split('/')[-1]) for result in results)
117
+
118
+ metadata_dataframe = pd.DataFrame(metadata_generator, columns=columns, index=index)
119
+
120
+
121
+ return metadata_dataframe
122
+
123
+
124
+
125
+
126
 
 
 
127
 
 
 
 
 
 
 
128
 
 
project_log.ipynb CHANGED
@@ -18,11 +18,56 @@
18
  "- Created log to record important progress\n",
19
  "\n",
20
  "Restructuring project\n",
21
- "- Remove extraneous files from the data folder. The relevant 'data' for this project consists of \n",
 
22
  " - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
23
  " - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
24
  "\n"
25
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  }
27
  ],
28
  "metadata": {
 
18
  "- Created log to record important progress\n",
19
  "\n",
20
  "Restructuring project\n",
21
+ "- Remove extraneous files from \n",
22
+ "the data folder. The relevant 'data' for this project consists of \n",
23
  " - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
24
  " - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
25
  "\n"
26
  ]
27
+ },
28
+ {
29
+ "attachments": {},
30
+ "cell_type": "markdown",
31
+ "metadata": {},
32
+ "source": [
33
+ "## Thursday 6.29.2023\n",
34
+ "\n",
35
+ "- Created data_storage.py\n",
36
+ "- This houses all functions and classes related to data storage\n",
37
+ "- Created a class ArXivData to store arxiv metadata\n",
38
+ " - Designed to be passed to embedding class to vectorize\n",
39
+ " - The embedding class should call the cleaning methods under the hood.\n",
40
+ "- Can load raw metadata from a query. \n",
41
+ " - Only stores id, title, abstract, categories\n",
42
+ " - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
43
+ "\n",
44
+ "#### Todo: Write ArXivData methods\n",
45
+ " 1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
46
+ " 1. `load_from_file`: load arxiv data from a parquet file.\n",
47
+ " 1. `Save_to_file`: to store data as a parquet\n",
48
+ " 2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
49
+ " - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
50
+ " actually load the webpages.\n",
51
+ " \n",
52
+ "\n",
53
+ "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
54
+ " - Make one column for arxiv tags, one column for msc tags\n",
55
+ " - store msc tags as their english names in a list\n",
56
+ " - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
57
+ "\n",
58
+ "#### Idea for the pipeline\n",
59
+ " 1. Load data in the ArXivData class whether from file or from query\n",
60
+ " 2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
61
+ " - What exactly is needed?\n",
62
+ " 3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
63
+ "\n",
64
+ "#### EDA tools needed\n",
65
+ " 1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
66
+ " 2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
67
+ " - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
68
+ "\n",
69
+ "\n"
70
+ ]
71
  }
72
  ],
73
  "metadata": {