Post
2102
The Coachella of Computer Vision, CVPR, is right around the corner. In anticipation of the conference, I curated a dataset of the papers.
I'll have a technical blog post out tomorrow doing some analysis on the dataset, but I'm so hyped that I wanted to get it out to the community ASAP.
The dataset consists of the following fields:
- An image of the first page of the paper
-
-
-
-
-
-
-
-
Here's how I created the dataset ππΌ
Generic code for building this dataset can be found [here](https://github.com/harpreetsahota204/CVPR-2024-Papers).
This dataset was built using the following steps:
- Scrape the CVPR 2024 website for accepted papers
- Use DuckDuckGo to search for a link to the paper's abstract on arXiv
- Use arXiv.py (python wrapper for the arXiv API) to extract the abstract and categories, and download the pdf for each paper
- Use pdf2image to save the image of paper's first page
- Use GPT-4o to extract keywords from the abstract
Voxel51/CVPR_2024_Papers
I'll have a technical blog post out tomorrow doing some analysis on the dataset, but I'm so hyped that I wanted to get it out to the community ASAP.
The dataset consists of the following fields:
- An image of the first page of the paper
-
title
: The title of the paper-
authors_list
: The list of authors-
abstract
: The abstract of the paper-
arxiv_link
: Link to the paper on arXiv-
other_link
: Link to the project page, if found-
category_name
: The primary category this paper according to [arXiv taxonomy](https://arxiv.org/category_taxonomy)-
all_categories
: All categories this paper falls into, according to arXiv taxonomy-
keywords
: Extracted using GPT-4oHere's how I created the dataset ππΌ
Generic code for building this dataset can be found [here](https://github.com/harpreetsahota204/CVPR-2024-Papers).
This dataset was built using the following steps:
- Scrape the CVPR 2024 website for accepted papers
- Use DuckDuckGo to search for a link to the paper's abstract on arXiv
- Use arXiv.py (python wrapper for the arXiv API) to extract the abstract and categories, and download the pdf for each paper
- Use pdf2image to save the image of paper's first page
- Use GPT-4o to extract keywords from the abstract
Voxel51/CVPR_2024_Papers