Spaces:
Runtime error
Runtime error
File size: 4,241 Bytes
af72c45 cd530cf af72c45 cd530cf b0e8ca7 cd530cf b0e8ca7 b0ee416 af72c45 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project log"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Monday 6.26.2023\n",
"\n",
"- Created log to record important progress\n",
"\n",
"Restructuring project\n",
"- Remove extraneous files from \n",
"the data folder. The relevant 'data' for this project consists of \n",
" - The arxiv metadata the model is trained on. For the prototype we use 20k PDE/Spectral theory articles titled 'APSP.parquet'\n",
" - The MSC tag database. A json dictionary mapping the 5 digit codes e.g. 38A17 to their corresponding english names.\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Thursday 6.29.2023\n",
"\n",
"- Created data_storage.py\n",
"- This houses all functions and classes related to data storage\n",
"- Created a class ArXivData to store arxiv metadata\n",
" - Designed to be passed to embedding class to vectorize\n",
" - The embedding class should call the cleaning methods under the hood.\n",
"- Can load raw metadata from a query. \n",
" - Only stores id, title, abstract, categories\n",
" - Faster than previous version, now can retrieve 1k articles in ~10 seconds\n",
"\n",
"#### Todo: Write ArXivData methods\n",
" 1. `get_full_metadata`: take a list of ids and retrieve all of the available metadata as a generator.\n",
" 1. `load_from_file`: load arxiv data from a parquet file.\n",
" 1. `Save_to_file`: to store data as a parquet\n",
" 2. How to improve the query functionality so that we can make larger queries, say all math articles in the last year.\n",
" - need a way of breaking up an arxiv api call into pieces. How exactly does the code work? Creating the generator object doesn't \n",
" actually load the webpages.\n",
" \n",
"\n",
"#### Todo: In `load_from_query` function, fix the problem that the categories tags are not returned properly\n",
" - Make one column for arxiv tags, one column for msc tags\n",
" - store msc tags as their english names in a list\n",
" - store arxiv cats one-hot-encoded as a separate dataframe attribute\n",
"\n",
"#### Idea for the pipeline\n",
" 1. Load data in the ArXivData class whether from file or from query\n",
" 2. Pass to embedding class to either create or load the necessary embeddings and prepare it to be easily fed into a topic model\n",
" - What exactly is needed?\n",
" 3. Pass to topic model (BERTopic, LSA, LDA, PCA) experiment with multiple.\n",
"\n",
"#### EDA tools needed\n",
" 1. Semantic analysis of MSC tags to choose the best one out of the labels for the 'category'\n",
" 2. Are there better ideas that don't just ammount to labeling based on semnatic similarity with the tag?\n",
" - an EDA question: Are the tagged MSC tags the top most semantically similar to the title/abstract?\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 07/02/2023\n",
"\n",
"-Read medium article about using config files to set up highly modular data analysis pipelines.\n",
"-Interested in setting this up here\n",
"\n",
"#### Outline of pipeline architecture\n",
"\n",
"1. Load dataset \n",
" - option to load from file or from querying arxiv directly\n",
" - stores raw title and abstract, id #s, msc_tags as english, and categories (OHE) as a separate dataframe\n",
"2. Load embeddings\n",
" - option to load from file or generate using sentence transformers directly.\n",
" - any data cleaning procedures will occur in the pipeline here\n",
"3. Plug into topic model(s)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.11"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
|