Spaces:

mlgeis
/

ArXivRecommenderSystem

Runtime error

App Files Files Community

Michael-Geis commited on Jun 30, 2023

Commit

b0e8ca7

•

1 Parent(s): 9b818c8

see 6.29 log notes for these changes

Browse files

Files changed (3) hide show

collection.ipynb +225 -0
arxiv_query_retrieval.py → data_storage.py +29 -17
project_log.ipynb +46 -1

collection.ipynb CHANGED Viewed

@@ -35953,6 +35953,231 @@
     "pd.set_option('display.max_colwidth', 0)\n",
     "clean_data.head()"
    ]
   }
  ],
  "metadata": {

     "pd.set_option('display.max_colwidth', 0)\n",
     "clean_data.head()"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 168,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>title</th>\n",
+       "      <th>summary</th>\n",
+       "      <th>categories</th>\n",
+       "      <th>id</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Boundedness, Ultracontractive Bounds and Optim...</td>\n",
+       "      <td>We investigate some regularity properties of a...</td>\n",
+       "      <td>[math.AP, 35B65, 35D30, 35K10, 35B45]</td>\n",
+       "      <td>2306.17152v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>The compressible Navier-Stokes equations with ...</td>\n",
+       "      <td>We show the global existence of a weak solutio...</td>\n",
+       "      <td>[math.AP, 35Q30, 76D03, 35K85]</td>\n",
+       "      <td>2305.00822v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>A simple reaction-diffusion system as a possib...</td>\n",
+       "      <td>Chemotaxis is a directed cell movement in resp...</td>\n",
+       "      <td>[math.AP, 92-10]</td>\n",
+       "      <td>2211.06933v3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Optimal blowup stability for three-dimensional...</td>\n",
+       "      <td>We study corotational wave maps from $(1+3)$-d...</td>\n",
+       "      <td>[math.AP, math-ph, math.MP]</td>\n",
+       "      <td>2212.08374v3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>A Note on $L^1-$contractive property of the so...</td>\n",
+       "      <td>In this note, we study the $L^1-$contractive p...</td>\n",
+       "      <td>[math.AP]</td>\n",
+       "      <td>2306.17064v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19995</th>\n",
+       "      <td>Exact controllability of the linear Zakharov-K...</td>\n",
+       "      <td>We consider the linear Zakharov-Kuznetsov equa...</td>\n",
+       "      <td>[math.AP]</td>\n",
+       "      <td>1912.03066v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19996</th>\n",
+       "      <td>Blow-up for the pointwise NLS in dimension two...</td>\n",
+       "      <td>We consider the Schr\\\"odinger equation in dime...</td>\n",
+       "      <td>[math.AP, math-ph, math.FA, math.MP]</td>\n",
+       "      <td>1808.10343v4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19997</th>\n",
+       "      <td>Inverse problems with partial data for ellipti...</td>\n",
+       "      <td>For a second order formally symmetric elliptic...</td>\n",
+       "      <td>[math.AP, math-ph, math.MP]</td>\n",
+       "      <td>1912.03047v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19998</th>\n",
+       "      <td>Inverse problems for the nonlinear modified tr...</td>\n",
+       "      <td>This article is devoted to inverse problems fo...</td>\n",
+       "      <td>[math.AP]</td>\n",
+       "      <td>1912.02996v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19999</th>\n",
+       "      <td>A non-autonomous bifurcation problem for a non...</td>\n",
+       "      <td>In this paper we study the asymptotic behavior...</td>\n",
+       "      <td>[math.AP]</td>\n",
+       "      <td>1912.02995v1</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>20000 rows × 4 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                   title  ...            id\n",
+       "0      Boundedness, Ultracontractive Bounds and Optim...  ...  2306.17152v1\n",
+       "1      The compressible Navier-Stokes equations with ...  ...  2305.00822v2\n",
+       "2      A simple reaction-diffusion system as a possib...  ...  2211.06933v3\n",
+       "3      Optimal blowup stability for three-dimensional...  ...  2212.08374v3\n",
+       "4      A Note on $L^1-$contractive property of the so...  ...  2306.17064v1\n",
+       "...                                                  ...  ...           ...\n",
+       "19995  Exact controllability of the linear Zakharov-K...  ...  1912.03066v1\n",
+       "19996  Blow-up for the pointwise NLS in dimension two...  ...  1808.10343v4\n",
+       "19997  Inverse problems with partial data for ellipti...  ...  1912.03047v1\n",
+       "19998  Inverse problems for the nonlinear modified tr...  ...  1912.02996v1\n",
+       "19999  A non-autonomous bifurcation problem for a non...  ...  1912.02995v1\n",
+       "\n",
+       "[20000 rows x 4 columns]"
+      ]
+     },
+     "execution_count": 168,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import data_storage\n",
+    "import importlib\n",
+    "importlib.reload(data_storage)\n",
+    "\n",
+    "\n",
+    "data = data_storage.ArXivData()\n",
+    "\n",
+    "max_results = 20000\n",
+    "offset = 0\n",
+    "data.load_from_query(query_string='cat:math.AP',\n",
+    "                     max_results=max_results,\n",
+    "                     offset=offset,\n",
+    "                     )\n",
+    "data.data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "On the stability of critical points of the Hardy-Littlewood-Sobolev inequality 2023-06-28 01:31:15+00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "import arxiv\n",
+    "from datetime import datetime , timedelta , timezone\n",
+    "\n",
+    "\n",
+    "search = arxiv.Search(query='cat:math.AP', max_results=1e3,sort_by=arxiv.SortCriterion.LastUpdatedDate, sort_order=arxiv.SortOrder.Descending)\n",
+    "\n",
+    "for result in search.results():\n",
+    "    if result.updated < datetime.now(timezone.utc) - timedelta(days=2):\n",
+    "        print(result.title,result.updated)\n",
+    "        break\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2023-05-16 20:01:32+00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "##\n",
+    "oldest = list(search.results())[-1]\n",
+    "print(oldest.updated)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2023-05-16 20:01:32+00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "*_, last = search.results()\n",
+    "print(last.updated)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

arxiv_query_retrieval.py → data_storage.py RENAMED Viewed

@@ -23,14 +23,15 @@ class ArXivData():
         self.data = None
         self.query = None
-        self.raw = None
         self.categories = None
-    def get_from_query(self,query_string,max_results):
-        self.data = query_to_df(query=query_string,max_results=max_results)
         self.query = (query_string,max_results)
-        self.raw = self.data
-        self.categories = self.get_OHE_cats()
     def clean(self,dataset):
@@ -78,7 +79,7 @@ def format_query(author='',title='',cat='',abstract=''):
-def query_to_df(query,max_results):
     """Returns the results of an arxiv API query in a pandas dataframe.
     Args:
@@ -87,6 +88,8 @@ def query_to_df(query,max_results):
         max_results: positive integer specifying the maximum number of results returned.
     Returns:
         pandas dataframe with one column for indivial piece of metadata of a returned result.
         To see a list of these columns and their descriptions, see the documentation for the Results class of the arxiv package here:
@@ -95,22 +98,31 @@ def query_to_df(query,max_results):
         The 'links' column is dropped and the authors column is a list of each author's name as a string.
         The categories column is also a list of all tags appearing.
     """
-    client = arxiv.Client(page_size=100,num_retries=3)
     search = arxiv.Search(
             query = query,
             max_results=max_results,
             sort_by=arxiv.SortCriterion.LastUpdatedDate
             )
-    results = client.results(search)
-    drop_cols = ['authors','links','_raw']
-    df = pd.DataFrame()
-    for result in results:
-        row_dict = {k : v for (k,v) in vars(result).items() if k not in drop_cols}
-        row_dict['authors'] = [author.name for author in result.authors]
-        row_dict['links'] = [link.href for link in result.links]
-        row = pd.Series(row_dict)
-        df = pd.concat([df , row.to_frame().transpose()], axis = 0)
-    return df.reset_index(drop=True,inplace=False)

         self.data = None
         self.query = None
         self.categories = None
+    def load_from_file():
+        pass
+    def load_from_query(self,query_string,max_results,offset):
+        self.data = query_to_df(query=query_string,max_results=max_results,offset=offset)
         self.query = (query_string,max_results)
+        #self.categories = self.get_OHE_cats()
     def clean(self,dataset):
+def query_to_df(query,max_results,offset):
     """Returns the results of an arxiv API query in a pandas dataframe.
     Args:
         max_results: positive integer specifying the maximum number of results returned.
+        chunksize:
     Returns:
         pandas dataframe with one column for indivial piece of metadata of a returned result.
         To see a list of these columns and their descriptions, see the documentation for the Results class of the arxiv package here:
         The 'links' column is dropped and the authors column is a list of each author's name as a string.
         The categories column is also a list of all tags appearing.
     """
+    client = arxiv.Client(page_size=2000,num_retries=3)
     search = arxiv.Search(
             query = query,
             max_results=max_results,
             sort_by=arxiv.SortCriterion.LastUpdatedDate
             )
+    columns = ['title','summary','categories','id']
+    index = range(offset,max_results)
+    results = client.results(search,offset=offset)
+    metadata_generator = ((result.title,result.summary,
+                        result.categories,
+                        result.entry_id.split('/')[-1]) for result in results)
+    metadata_dataframe = pd.DataFrame(metadata_generator, columns=columns, index=index)
+    return metadata_dataframe

project_log.ipynb CHANGED Viewed

@@ -18,11 +18,56 @@
     "- Created log to record important progress\n",
     "\n",
     "Restructuring project\n",
-    "- Remove extraneous files from the data folder. The relevant 'data' for this project consists of \n",
     "    - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
     "    - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
     "\n"
    ]
   }
  ],
  "metadata": {

     "- Created log to record important progress\n",
     "\n",
     "Restructuring project\n",
+    "- Remove extraneous files from \n",
+    "the data folder. The relevant 'data' for this project consists of \n",
     "    - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
     "    - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
     "\n"
    ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Thursday 6.29.2023\n",
+    "\n",
+    "- Created data_storage.py\n",
+    "- This houses all functions and classes related to data storage\n",
+    "- Created a class ArXivData to store arxiv metadata\n",
+    "    - Designed to be passed to embedding class to vectorize\n",
+    "    - The embedding class should call the cleaning methods under the hood.\n",
+    "- Can load raw metadata from a query. \n",
+    "    - Only stores id, title, abstract, categories\n",
+    "    - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
+    "\n",
+    "#### Todo: Write ArXivData methods\n",
+    "    1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
+    "    1. `load_from_file`: load arxiv data from a parquet file.\n",
+    "    1. `Save_to_file`: to store data as a parquet\n",
+    "    2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
+    "        - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
+    "        actually load the webpages.\n",
+    "     \n",
+    "\n",
+    "#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
+    "    - Make one column for arxiv tags, one column for msc tags\n",
+    "    - store msc tags as their english names in a list\n",
+    "    - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
+    "\n",
+    "#### Idea for the pipeline\n",
+    "    1. Load data in the ArXivData class whether from file or from query\n",
+    "    2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
+    "        - What exactly is needed?\n",
+    "    3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
+    "\n",
+    "#### EDA tools needed\n",
+    "    1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
+    "    2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
+    "        - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
+    "\n",
+    "\n"
+   ]
   }
  ],
  "metadata": {