Spaces:
Runtime error
Runtime error
File size: 4,925 Bytes
af72c45 cd530cf af72c45 cd530cf b0e8ca7 cd530cf b0e8ca7 b0ee416 cd5ad0c 756497b cd5ad0c af72c45 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project log"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Monday 6.26.2023\n",
"\n",
"- Created log to record important progress\n",
"\n",
"Restructuring project\n",
"- Remove extraneous files from \n",
"the data folder. The relevant 'data' for this project consists of \n",
" - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
" - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Thursday 6.29.2023\n",
"\n",
"- Created data_storage.py\n",
"- This houses all functions and classes related to data storage\n",
"- Created a class ArXivData to store arxiv metadata\n",
" - Designed to be passed to embedding class to vectorize\n",
" - The embedding class should call the cleaning methods under the hood.\n",
"- Can load raw metadata from a query. \n",
" - Only stores id, title, abstract, categories\n",
" - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
"\n",
"#### Todo: Write ArXivData methods\n",
" 1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
" 1. `load_from_file`: load arxiv data from a parquet file.\n",
" 1. `Save_to_file`: to store data as a parquet\n",
" 2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
" - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
" actually load the webpages.\n",
" \n",
"\n",
"#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
" - Make one column for arxiv tags, one column for msc tags\n",
" - store msc tags as their english names in a list\n",
" - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
"\n",
"#### Idea for the pipeline\n",
" 1. Load data in the ArXivData class whether from file or from query\n",
" 2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
" - What exactly is needed?\n",
" 3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
"\n",
"#### EDA tools needed\n",
" 1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
" 2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
" - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 07/02/2023\n",
"\n",
"-Read medium article about using config files to set up highly modular data analysis pipelines.\n",
"-Interested in setting this up here\n",
"\n",
"#### Outline of pipeline architecture\n",
"\n",
"1. Load dataset \n",
" - option to load from file or from querying arxiv directly\n",
" - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n",
"2. Load embeddings\n",
" - option to load from file or generate using sentence transformers directly.\n",
" - any data cleaning procedures will occur in the pipeline here\n",
"3. Plug into topic model(s)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 07/03/2023\n",
"\n",
"#### Modified data_storage.py\n",
"\n",
"Done:\n",
"1. Wrote `load_from_feather` and `save_to_feather`\n",
"1. Pulled and stored metadata for 40k papers in pde and spectral theory called 'APSP_40.feather'\n",
"\n",
"To Do:\n",
"1. Write comments for the methods in the arXivData class.\n",
"1. Make sure the class functionality works correctly when a query returns no results.\n",
"\n",
"\n",
"#### Miscellaneous\n",
"1. Install `tabbed out` extension for exiting delimiter environments with tab.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.11"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
|