Scraper_hub / Scrapling /docs /development /adaptive_storage_system.md
AUXteam's picture
Upload folder using huggingface_hub
e840680 verified

Writing your retrieval system

Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the adaptive feature.

You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other.

So first, to make your storage class work, it must do the big 3:

  1. Inherit from the abstract class scrapling.core.storage.StorageSystemMixin and accept a string argument, which will be the url argument to maintain the library logic.
  2. Use the decorator functools.lru_cache on top of the class to follow the Singleton design pattern as other classes.
  3. Implement methods save and retrieve, as you see from the type hints:
    • The method save returns nothing and will get two arguments from the library
      • The first one is of type lxml.html.HtmlElement, which is the element itself. It must be converted to a dictionary using the element_to_dict function in the submodule scrapling.core.utils._StorageTools to maintain the same format, and then saved to your database as you wish.
      • The second one is a string, the identifier used for retrieval. The combination result of this identifier and the url argument from initialization must be unique for each row, or the adaptive data will be messed up.
    • The method retrieve takes a string, which is the identifier; using it with the url passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns None.

If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in storage_adaptors file

If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.

Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the code; it's heavily commented :)

Real-World Example: Redis Storage

Here's a more practical example generated by AI using Redis:

import redis
import orjson
from functools import lru_cache
from scrapling.core.storage import StorageSystemMixin
from scrapling.core.utils import _StorageTools

@lru_cache(None)
class RedisStorage(StorageSystemMixin):
    def __init__(self, host='localhost', port=6379, db=0, url=None):
        super().__init__(url)
        self.redis = redis.Redis(
            host=host,
            port=port,
            db=db,
            decode_responses=False
        )
        
    def save(self, element, identifier: str) -> None:
        # Convert element to dictionary
        element_dict = _StorageTools.element_to_dict(element)
        
        # Create key
        key = f"scrapling:{self._get_base_url()}:{identifier}"
        
        # Store as JSON
        self.redis.set(
            key,
            orjson.dumps(element_dict)
        )
        
    def retrieve(self, identifier: str) -> dict | None:
        # Get data
        key = f"scrapling:{self._get_base_url()}:{identifier}"
        data = self.redis.get(key)
        
        # Parse JSON if exists
        if data:
            return orjson.loads(data)
        return None