{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# EDA for cleaned arXiv dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Which subject tag occurs the most frequently our 175k dataset?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Accelerator Physics | \n", "Adaptation and Self-Organizing Systems | \n", "Algebraic Geometry | \n", "Algebraic Topology | \n", "Analysis of PDEs | \n", "Applications | \n", "Applied Physics | \n", "Artificial Intelligence | \n", "Astrophysics | \n", "Astrophysics of Galaxies | \n", "... | \n", "Strongly Correlated Electrons | \n", "Subcellular Processes | \n", "Superconductivity | \n", "Symbolic Computation | \n", "Symplectic Geometry | \n", "Systems and Control | \n", "Theoretical Economics | \n", "Tissues and Organs | \n", "Trading and Market Microstructure | \n", "UNK | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
1 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
2 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
3 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
4 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
5 rows × 150 columns
\n", "\n", " | raw_title | \n", "clean_title | \n", "hyph_in_title | \n", "raw_abstract | \n", "clean_abstract | \n", "hyph_in_abstract | \n", "authors_parsed | \n", "cat | \n", "update_date | \n", "id | \n", "
---|---|---|---|---|---|---|---|---|---|---|
42 | \n", "The Prolongation Problem for the Heavenly Equa... | \n", "The Prolongation Problem for the Heavenly Equa... | \n", "None | \n", "We provide an exact regular solution of an o... | \n", "We provide an exact regular solution of an o... | \n", "None | \n", "[['Palese', 'M.', '', 'Dept. Math. Univ. of To... | \n", "[math.AP, math-ph, math.MP] | \n", "2022-09-21 | \n", "math/0311218 | \n", "
55 | \n", "Null Controllability for a Degenerate Structur... | \n", "Null Controllability for a Degenerate Structur... | \n", "None | \n", "In this paper, we consider the infinite dime... | \n", "In this paper, we consider the infinite dime... | \n", "[final-state] | \n", "[['Simporé', 'Yacouba', ''], ['gantouh', 'Yass... | \n", "[math.OC, math.AP] | \n", "2022-09-09 | \n", "2209.03645 | \n", "
59 | \n", "Voting models and semilinear parabolic equations | \n", "Voting models and semilinear parabolic equations | \n", "None | \n", "We present probabilistic interpretations of ... | \n", "We present probabilistic interpretations of ... | \n", "[semi-linear, Fisher-KPP, group-based, pushmi-... | \n", "[['An', 'Jing', ''], ['Henderson', 'Christophe... | \n", "[math.AP, math.PR] | \n", "2022-09-09 | \n", "2209.03435 | \n", "
72 | \n", "Flows of $G_2$-Structures associated to Calabi... | \n", "Flows of LATEX associated to Calabi-Yau Manif... | \n", "[Calabi-Yau] | \n", "We establish a correspondence between a para... | \n", "We establish a correspondence between a para... | \n", "[Monge-Ampere, Monge-Ampere, torsion-free, Ric... | \n", "[['Picard', 'Sébastien', ''], ['Suan', 'Caleb'... | \n", "[math.DG, math.AP] | \n", "2022-09-09 | \n", "2209.03411 | \n", "
78 | \n", "On the dynamics of vortices in viscous 2D flows | \n", "On the dynamics of vortices in viscous 2D flows | \n", "None | \n", "We study the 2D Navier--Stokes solution star... | \n", "We study the 2D Navier--Stokes solution star... | \n", "None | \n", "[['Ceci', 'Stefano', ''], ['Seis', 'Christian'... | \n", "[math.AP] | \n", "2022-09-09 | \n", "2203.07185 | \n", "