|
### Session Management |
|
|
|
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for: |
|
|
|
- **Performing JavaScript actions before and after crawling.** |
|
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly. |
|
|
|
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations. |
|
|
|
--- |
|
|
|
#### Basic Session Usage |
|
|
|
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`: |
|
|
|
```python |
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig |
|
|
|
async with AsyncWebCrawler() as crawler: |
|
session_id = "my_session" |
|
|
|
# Define configurations |
|
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id) |
|
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id) |
|
|
|
# First request |
|
result1 = await crawler.arun(config=config1) |
|
|
|
# Subsequent request using the same session |
|
result2 = await crawler.arun(config=config2) |
|
|
|
# Clean up when done |
|
await crawler.crawler_strategy.kill_session(session_id) |
|
``` |
|
|
|
--- |
|
|
|
#### Dynamic Content with Sessions |
|
|
|
Here's an example of crawling GitHub commits across multiple pages while preserving session state: |
|
|
|
```python |
|
from crawl4ai.async_configs import CrawlerRunConfig |
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy |
|
from crawl4ai.cache_context import CacheMode |
|
|
|
async def crawl_dynamic_content(): |
|
async with AsyncWebCrawler() as crawler: |
|
session_id = "github_commits_session" |
|
url = "https://github.com/microsoft/TypeScript/commits/main" |
|
all_commits = [] |
|
|
|
# Define extraction schema |
|
schema = { |
|
"name": "Commit Extractor", |
|
"baseSelector": "li.Box-sc-g0xbh4-0", |
|
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}], |
|
} |
|
extraction_strategy = JsonCssExtractionStrategy(schema) |
|
|
|
# JavaScript and wait configurations |
|
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();""" |
|
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0""" |
|
|
|
# Crawl multiple pages |
|
for page in range(3): |
|
config = CrawlerRunConfig( |
|
url=url, |
|
session_id=session_id, |
|
extraction_strategy=extraction_strategy, |
|
js_code=js_next_page if page > 0 else None, |
|
wait_for=wait_for if page > 0 else None, |
|
js_only=page > 0, |
|
cache_mode=CacheMode.BYPASS |
|
) |
|
|
|
result = await crawler.arun(config=config) |
|
if result.success: |
|
commits = json.loads(result.extracted_content) |
|
all_commits.extend(commits) |
|
print(f"Page {page + 1}: Found {len(commits)} commits") |
|
|
|
# Clean up session |
|
await crawler.crawler_strategy.kill_session(session_id) |
|
return all_commits |
|
``` |
|
|
|
--- |
|
|
|
#### Session Best Practices |
|
|
|
1. **Descriptive Session IDs**: |
|
Use meaningful names for session IDs to organize workflows: |
|
```python |
|
session_id = "login_flow_session" |
|
session_id = "product_catalog_session" |
|
``` |
|
|
|
2. **Resource Management**: |
|
Always ensure sessions are cleaned up to free resources: |
|
```python |
|
try: |
|
# Your crawling code here |
|
pass |
|
finally: |
|
await crawler.crawler_strategy.kill_session(session_id) |
|
``` |
|
|
|
3. **State Maintenance**: |
|
Reuse the session for subsequent actions within the same workflow: |
|
```python |
|
# Step 1: Login |
|
login_config = CrawlerRunConfig( |
|
url="https://example.com/login", |
|
session_id=session_id, |
|
js_code="document.querySelector('form').submit();" |
|
) |
|
await crawler.arun(config=login_config) |
|
|
|
# Step 2: Verify login success |
|
dashboard_config = CrawlerRunConfig( |
|
url="https://example.com/dashboard", |
|
session_id=session_id, |
|
wait_for="css:.user-profile" # Wait for authenticated content |
|
) |
|
result = await crawler.arun(config=dashboard_config) |
|
``` |
|
|
|
--- |
|
|
|
#### Common Use Cases for Sessions |
|
|
|
1. **Authentication Flows**: Login and interact with secured pages. |
|
2. **Pagination Handling**: Navigate through multiple pages. |
|
3. **Form Submissions**: Fill forms, submit, and process results. |
|
4. **Multi-step Processes**: Complete workflows that span multiple actions. |
|
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content. |
|
|