How to build an incremental Web Crawler with Apify

Community Article Published August 23, 2024

When monitoring websites for updates, it's often inefficient to crawl the entire site repeatedly. This is where the Incremental Link Crawler comes in -- a tool designed to identify and return only the newly added or updated pages within a specified time frame. By focusing on recent changes, you can streamline your web monitoring tasks and reduce unnecessary overhead. In this guide, we'll walk you through how to use the Incremental Link Crawler and explore its features and configuration options.

How it Works

The Incremental Link Crawler takes two main inputs:

  • url: the website you want to monitor
  • daysAgo: the number of days back you want to search for updates (default is 1 day)

Example

The crawler will then identify pages that have been added or updated within the specified time frame and return a list of URLs pointing to the newly added or updated pages.

**Integration with other actors **, e.g. apify/website-content-crawler

The user creates a task for the next actor and passes the task ID to the incremental crawler, who calls the task after finding updated links.

  • nextRunId: ID for the subsequent task (optional)
  • nextRunAttribute: attribute to update in the next run (default is 'startUrls', optional)

Example Configurations

Basic Setup

jsonCopy code{
"url": "https://www.example-news-site.com",
"daysAgo": 2,
"maxResults": 100
}

Advanced Setup with Follow-Up Task

{
....
"nextRunId": "your-task-id",
"nextRunAttribute": "startUrls"
}

Tips for Optimal Results

  • Use a moving window to ensure you capture all relevant updates. For example, set daysAgo to 2 or 3 days for daily crawls, or 8 or 9 days for weekly crawls.

For more information on using the Incremental Link Crawler, please refer to our documentation.