Spaces:
Sleeping
Writing your retrieval system
Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the adaptive feature.
You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other.
So first, to make your storage class work, it must do the big 3:
- Inherit from the abstract class
scrapling.core.storage.StorageSystemMixinand accept a string argument, which will be theurlargument to maintain the library logic. - Use the decorator
functools.lru_cacheon top of the class to follow the Singleton design pattern as other classes. - Implement methods
saveandretrieve, as you see from the type hints:- The method
savereturns nothing and will get two arguments from the library- The first one is of type
lxml.html.HtmlElement, which is the element itself. It must be converted to a dictionary using theelement_to_dictfunction in the submodulescrapling.core.utils._StorageToolsto maintain the same format, and then saved to your database as you wish. - The second one is a string, the identifier used for retrieval. The combination result of this identifier and the
urlargument from initialization must be unique for each row, or theadaptivedata will be messed up.
- The first one is of type
- The method
retrievetakes a string, which is the identifier; using it with theurlpassed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returnsNone.
- The method
If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in storage_adaptors file
If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.
Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the code; it's heavily commented :)
Real-World Example: Redis Storage
Here's a more practical example generated by AI using Redis:
import redis
import orjson
from functools import lru_cache
from scrapling.core.storage import StorageSystemMixin
from scrapling.core.utils import _StorageTools
@lru_cache(None)
class RedisStorage(StorageSystemMixin):
def __init__(self, host='localhost', port=6379, db=0, url=None):
super().__init__(url)
self.redis = redis.Redis(
host=host,
port=port,
db=db,
decode_responses=False
)
def save(self, element, identifier: str) -> None:
# Convert element to dictionary
element_dict = _StorageTools.element_to_dict(element)
# Create key
key = f"scrapling:{self._get_base_url()}:{identifier}"
# Store as JSON
self.redis.set(
key,
orjson.dumps(element_dict)
)
def retrieve(self, identifier: str) -> dict | None:
# Get data
key = f"scrapling:{self._get_base_url()}:{identifier}"
data = self.redis.get(key)
# Parse JSON if exists
if data:
return orjson.loads(data)
return None