Web Scraping 101

Community Article Published August 19, 2024

Introduction

In previous articles we covered the basics of data formats and processing parquets. This time we're covering the basics of web scraping, actually, you're going to be thrown into the deep end and learn from a real world example, try to follow along and don't worry; we'll cover more of the basics in later articles.

What will we be scraping?

We are going to be scraping CivitAI, however, we're not going to work with the documented API, we are going to work with the real API that the site itself uses.

If you're sitting comfortably, let's begin.

Stage 1: Recon

The first stage of any good mission is reconnaissance. To get started, direct your browser to CivitAI. This article is best followed with a Chrome based browser such as Chrome or Edge, other browsers may vary in the exact locations and naming of the concepts we're discussing.

We're on the homepage, we can see some examples of all that good data we want, but where does it come from? Let's find out.

Open your browser's Developer Tools, depending on your browser the keyboard shortcut is F12. Select the Network tab, then select Fetch/XHR - as the name suggests this filters requests to those that fetch data, which is what we're looking for.

Go ahead and refresh the page, you'll see some requests appear, select one and review the contents under the tabs titled Headers, Payload, Preview and Response. These are what we'll be looking at to determine the requests we'll be sending later.

You should see several requests with names beginning with homeBlock.getHomeBlock. Select the first one of these, then select the Headers tab.

Looking at the General section we can see the Request URL is something like https://civitai.com/api/trpc/homeBlock.getHomeBlock?input=%7B%22json%22%3A%7B%22id%22%3A2%7D%7D, the Request Method is GET, this basic information forms the basis of all requests we'll make, the URL and request type.

Let's look under Payload, as this is a GET request we're seeing Query String Parameters, and options for view source and view decoded. You may notice this is the same data we're seeing in the Request URL, that's the nature of GET requests, the query is a part of the URL. Click view decoded and we'll see something like input: {"json":{"id":2}}, much easier to read and utilize later.

Now check out Preview, this is the data returned by the request, in a nice foldable preview, go ahead and unfold everything and we'll see that this particular request is returning metadata about announcements, cool!

{
    "result": {
        "data": {
            "json": {
                "id": 2,
                "metadata": {
                    "title": "Announcements",
                    "announcements": {
                        "limit": 4
                    }
                },
                "type": "Announcement",
                "userId": -1,
                "sourceId": null
            }
        }
    }
}

Let's checkout the next request, follow the same steps and we'll find that it is using a very similar query; input: {"json":{"id":3}}. This time the id is 3, and the data being returned...

{
    "result": {
        "data": {
            "json": {
                "id": 94874,
                "metadata": {
                    "link": "/images",
                    "title": "Featured Images",
                    "linkText": "Explore all images",
                    "withIcon": true,
                    "collection": {
                        "id": 107,
                        "rows": 2,
                        "limit": 8
                    },
                    "description": "All sorts of cool pictures created by our community, from simple shapes to detailed landscapes or human faces. A virtual canvas where you can unleash your creativity or get inspired."
                },
                "type": "Collection",
                "userId": 3942244,
                "sourceId": 3,
                "collection": {
                    "id": 107,
                    "name": "Featured Images",
                    "description": "",
                    "read": "Public",
                    "write": "Private",
                    "type": "Image",
                    "nsfw": false,
                    "nsfwLevel": 31,
                    "image": null,
                    "mode": null,
                    "metadata": {},
                    "availability": "Public",
                    "userId": -1,
                    "tags": [],
                    "user": {
                        "id": -1,
                        "username": "civitai",
                        "deletedAt": null,
                        "image": null,
                        "profilePicture": null,
                        "cosmetics": []
                    },
                    "items": [
                        {
                            "id": 50828946,
                            "status": "ACCEPTED",
                            "createdAt": "2024-08-19T08:24:33.498Z",
                            "randomId": 9311071,
                            "type": "image",
                            "data": {
                                "id": 25066010,
                                "name": "ComfyUI_05170_.png",
                                "url": "0dbda714-09a2-4960-98e1-153f430af450",
                                "nsfwLevel": 1,
                                "width": 1792,
                                "height": 2304,
                                "hash": "UACYmt.70jRP~ocEIpv#5?-o}-IWEO=^=_0*",
                                "hideMeta": false,
                                "hasMeta": false,
                                "onSite": false,
                                "generationProcess": null,
                                "createdAt": "2024-08-19T07:47:36.175Z",
                                "sortAt": "2024-08-19T07:47:56.935Z",
                                "mimeType": "image/png",
                                "type": "image",
                                "metadata": {
                                    "hash": "UACYmt.70jRP~ocEIpv#5?-o}-IWEO=^=_0*",
                                    "size": 7394231,
                                    "width": 1792,
                                    "height": 2304
                                },
                                "ingestion": "Scanned",
                                "scannedAt": "2024-08-19T07:47:44.831Z",
                                "needsReview": null,
                                "postId": 5601158,
                                "postTitle": null,
                                "index": 1,
                                "publishedAt": "2024-08-19T07:47:56.935Z",
                                "modelVersionId": 691639,
                                "availability": "Public",
                                "baseModel": "Flux.1 D",
                                "user": {
                                    "id": 1508786,
                                    "username": "Harri79",
                                    "image": null,
                                    "deletedAt": null,
                                    "cosmetics": [],
                                    "profilePicture": null
                                },
                                "stats": {
                                    "cryCountAllTime": 12,
                                    "laughCountAllTime": 47,
                                    "likeCountAllTime": 401,
                                    "dislikeCountAllTime": 0,
                                    "heartCountAllTime": 143,
                                    "commentCountAllTime": 1,
                                    "collectedCountAllTime": 1,
                                    "tippedAmountCountAllTime": 79,
                                    "viewCountAllTime": 0
                                },
                                "reactions": [],
                                "tagIds": [
                                    2309,
                                    5448,
                                    3350,
                                    111931,
                                    120041,
                                    112807,
                                    8447,
                                    140292,
                                    112289,
                                    116352
                                ],
                                "cosmetic": null
                            }
                        },
                    ]
                }
            }
        }
    }
}

Awesome! This is more like what we want; images!

Stage 1a: Research

The second part of Recon is Research, before we go any further we want to just try and make this request, to identify any potential problems.

We are going to be using curl-cffi, this library is a wrapper around curl-impersonate. Compared to Python's requests this is much faster, and requests from requests can sometimes be blocked due to JA3 fingerprinting. As the name curl-impersonate suggests we will be able to specify a browser to impersonate.

If you haven't already, install curl-cffi now using pip.

pip install curl-cffi

Open your favourite code editor, a Python interactive shell, or Jupyter notebook and let's begin.

If you've used requests library before the interface for curl-cffi is very similar, the main difference is the impersonate parameter which we will demonstrate now.

from curl_cffi import requests

# copy the url from earlier
url = "https://civitai.com/api/trpc/homeBlock.getHomeBlock?input=%7B%22json%22%3A%7B%22id%22%3A3%7D%7D"

r = requests.get(url, impersonate="chrome")

j = r.json()

print(j)

Great! no problems yet. We can get the same data we're seeing on the homepage. Let's just clean this up a little, we know the input of the query is actually a JSON dict and we can change the id in that dict to get different homeBlocks, so...

import json

input_payload = {"json":{"id":3}}
input_query = {"input": json.dumps(input_payload, separators=(",", ":"))}
url = "https://civitai.com/api/trpc/homeBlock.getHomeBlock"
r = requests.get(url, params=input_query, impersonate="chrome")

j = r.json()

print(j)

That's better! we can more easily change the id in the input_payload. For input_query we are using json.dumps with separators=(",", ":") to match the existing query. curl-cffi does some furthe processing internally to produce the exact query parameters.

Stage 1b: More Recon!

We don't just want the images from the homepage, we want all the images, let's continue recon on the Explore all images page.

We see a new request with the name starting image.getInfinite, that sounds more like it!

Following the same steps we get something like...

url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {"json":{"period":"Week","sort":"Most Reactions","types":["image"],"browsingLevel":1,"include":["cosmetics"],"cursor":None},"meta":{"values":{"cursor":["undefined"]}}}
input_query = {"input": json.dumps(payload, separators=(",", ":"))}

r = requests.get(url, params=input_query, impersonate="chrome")

j = r.json()

print(j)
{'error': {'json': {'message': 'Please use the public API instead: https://github.com/civitai/civitai/wiki/REST-API-Reference', 'code': -32001, 'data': {'code': 'UNAUTHORIZED', 'httpStatus': 401, 'path': 'image.getInfinite'}}}}

Oh no! It seems there's wrong. Let's check the url and compare it to the original.

>>> r.url
'https://civitai.com/api/trpc/image.getInfinite?input=%7B%22json%22%3A%7B%22period%22%3A%22Week%22%2C%22sort%22%3A%22Most+Reactions%22%2C%22types%22%3A%5B%22image%22%5D%2C%22browsingLevel%22%3A1%2C%22include%22%3A%5B%22cosmetics%22%5D%2C%22cursor%22%3Anull%7D%2C%22meta%22%3A%7B%22values%22%3A%7B%22cursor%22%3A%5B%22undefined%22%5D%7D%7D%7D'

and the original...

https://civitai.com/api/trpc/image.getInfinite?input=%7B%22json%22%3A%7B%22period%22%3A%22Week%22%2C%22sort%22%3A%22Most%20Reactions%22%2C%22types%22%3A%5B%22image%22%5D%2C%22browsingLevel%22%3A1%2C%22include%22%3A%5B%22cosmetics%22%5D%2C%22cursor%22%3Anull%7D%2C%22meta%22%3A%7B%22values%22%3A%7B%22cursor%22%3A%5B%22undefined%22%5D%7D%7D%7D

We are getting Most+Reactions and the original is Most%20Reactions. This is due to how curl-cffi is processing the parameters internally. Let's build the full url ourselves.

from urllib.parse import quote

query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'

r = requests.get(f"{url}?{query}", impersonate="chrome")

j = r.json()

print(j)
{'result': {'data': {'json': {'nextCursor': '798|170|525|24907786', ...

Awesome! Great work so far. We've learned the importance of matching what we send to the original requests.

We'll take a short break at this point while we prepare for stage 2, where we'll learn about pagination and efficiently saving the data. We'll be using JSONL or jsonlines to store the data, so take a moment to check out Data Formats 101 which covers the basics of the format if it is unfamiliar to you.