|
### Using `storage_state` to Pre-Load Cookies and LocalStorage |
|
|
|
Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time. |
|
|
|
#### What is `storage_state`? |
|
|
|
`storage_state` can be: |
|
|
|
- A dictionary containing cookies and localStorage data. |
|
- A path to a JSON file that holds this information. |
|
|
|
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state. |
|
|
|
#### Example Structure |
|
|
|
Here’s an example storage state: |
|
|
|
```json |
|
{ |
|
"cookies": [ |
|
{ |
|
"name": "session", |
|
"value": "abcd1234", |
|
"domain": "example.com", |
|
"path": "/", |
|
"expires": 1675363572.037711, |
|
"httpOnly": false, |
|
"secure": false, |
|
"sameSite": "None" |
|
} |
|
], |
|
"origins": [ |
|
{ |
|
"origin": "https://example.com", |
|
"localStorage": [ |
|
{ "name": "token", "value": "my_auth_token" }, |
|
{ "name": "refreshToken", "value": "my_refresh_token" } |
|
] |
|
} |
|
] |
|
} |
|
``` |
|
|
|
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`. |
|
|
|
--- |
|
|
|
### Passing `storage_state` as a Dictionary |
|
|
|
You can directly provide the data as a dictionary: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
|
|
async def main(): |
|
storage_dict = { |
|
"cookies": [ |
|
{ |
|
"name": "session", |
|
"value": "abcd1234", |
|
"domain": "example.com", |
|
"path": "/", |
|
"expires": 1675363572.037711, |
|
"httpOnly": False, |
|
"secure": False, |
|
"sameSite": "None" |
|
} |
|
], |
|
"origins": [ |
|
{ |
|
"origin": "https://example.com", |
|
"localStorage": [ |
|
{"name": "token", "value": "my_auth_token"}, |
|
{"name": "refreshToken", "value": "my_refresh_token"} |
|
] |
|
} |
|
] |
|
} |
|
|
|
async with AsyncWebCrawler( |
|
headless=True, |
|
storage_state=storage_dict |
|
) as crawler: |
|
result = await crawler.arun(url='https://example.com/protected') |
|
if result.success: |
|
print("Crawl succeeded with pre-loaded session data!") |
|
print("Page HTML length:", len(result.html)) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
--- |
|
|
|
### Passing `storage_state` as a File |
|
|
|
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
|
|
async def main(): |
|
async with AsyncWebCrawler( |
|
headless=True, |
|
storage_state="mystate.json" # Uses a JSON file instead of a dictionary |
|
) as crawler: |
|
result = await crawler.arun(url='https://example.com/protected') |
|
if result.success: |
|
print("Crawl succeeded with pre-loaded session data!") |
|
print("Page HTML length:", len(result.html)) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
--- |
|
|
|
### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later) |
|
|
|
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can: |
|
|
|
1. Perform the login once in a hook. |
|
2. After login completes, export the resulting `storage_state` to a file. |
|
3. On subsequent runs, provide that `storage_state` to skip the login step. |
|
|
|
**Step-by-Step Example:** |
|
|
|
**First Run (Perform Login and Save State):** |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler, CacheMode |
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator |
|
|
|
async def on_browser_created_hook(browser): |
|
# Access the default context and create a page |
|
context = browser.contexts[0] |
|
page = await context.new_page() |
|
|
|
# Navigate to the login page |
|
await page.goto("https://example.com/login", wait_until="domcontentloaded") |
|
|
|
# Fill in credentials and submit |
|
await page.fill("input[name='username']", "myuser") |
|
await page.fill("input[name='password']", "mypassword") |
|
await page.click("button[type='submit']") |
|
await page.wait_for_load_state("networkidle") |
|
|
|
# Now the site sets tokens in localStorage and cookies |
|
# Export this state to a file so we can reuse it |
|
await context.storage_state(path="my_storage_state.json") |
|
await page.close() |
|
|
|
async def main(): |
|
# First run: perform login and export the storage_state |
|
async with AsyncWebCrawler( |
|
headless=True, |
|
verbose=True, |
|
hooks={"on_browser_created": on_browser_created_hook}, |
|
use_persistent_context=True, |
|
user_data_dir="./my_user_data" |
|
) as crawler: |
|
|
|
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json |
|
result = await crawler.arun( |
|
url='https://example.com/protected-page', |
|
cache_mode=CacheMode.BYPASS, |
|
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}), |
|
) |
|
print("First run result success:", result.success) |
|
if result.success: |
|
print("Protected page HTML length:", len(result.html)) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
**Second Run (Reuse Saved State, No Login Needed):** |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler, CacheMode |
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator |
|
|
|
async def main(): |
|
# Second run: no need to hook on_browser_created this time. |
|
# Just provide the previously saved storage state. |
|
async with AsyncWebCrawler( |
|
headless=True, |
|
verbose=True, |
|
use_persistent_context=True, |
|
user_data_dir="./my_user_data", |
|
storage_state="my_storage_state.json" # Reuse previously exported state |
|
) as crawler: |
|
|
|
# Now the crawler starts already logged in |
|
result = await crawler.arun( |
|
url='https://example.com/protected-page', |
|
cache_mode=CacheMode.BYPASS, |
|
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}), |
|
) |
|
print("Second run result success:", result.success) |
|
if result.success: |
|
print("Protected page HTML length:", len(result.html)) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
**What’s Happening Here?** |
|
|
|
- During the first run, the `on_browser_created_hook` logs into the site. |
|
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`. |
|
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps. |
|
|
|
**Sign Out Scenario:** |
|
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time. |
|
|
|
--- |
|
|
|
### Conclusion |
|
|
|
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines. |