Stefano Fiorucci
added installation section to README
5f2135e
|
raw
history blame
817 Bytes

Twin Peaks crawler

This crawler download texts and metadata from Twin Peaks Fandom Wiki. The output format is JSON. The crawler is based on the combination of Scrapy and fandom-py.

Several wiki pages are discarded, since they are not related to Twin Peaks plot and create noise in the Question Answering index.

Installation

  • copy this folder (if needed, see stackoverflow)
  • pip install -r requirements.txt

Usage

  • (if needed, activate the virtual environment)
  • cd tpcrawler
  • scrapy crawl tpcrawler
  • you can find the downloaded pages in data subfolder