Web Scraping 102

Community Article Published August 20, 2024

Introduction

Welcome to Web Scraping 102! In Web Scraping 101 we conducted our initial Recon and Research for scraping CivitAI. This article will cover retrieval and saving the data from the images endpoint we discovered earlier.

If you're sitting comfortably, let's begin.

Code Recap

Currently our code looks something like this:

from curl_cffi import requests
import json
from urllib.parse import quote

url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
    "json": {
        "period": "Week",
        "sort": "Most Reactions",
        "types": ["image"],
        "browsingLevel": 1,
        "include": ["cosmetics"],
        "cursor": None,
    },
    "meta": {"values": {"cursor": ["undefined"]}},
}

query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'

r = requests.get(f"{url}?{query}", impersonate="chrome")

j = r.json()

print(j)

We have defined the endpoint and payload, we are processing the payload to match the original request then making the request and print the retrieved data.

Our goal now is to retrieve more data and save it for later processing.

Stage 2: Retrieval

We are going to use jsonlines also known as JSONL to efficiently save the data as it's retrieved.

If you haven't already, install jsonlines now using pip.

pip install jsonlines

jsonlines provides an easy to use interface that automatically handles encoding, let's set that up:

import jsonlines
import pathlib

base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"

writer = jsonlines.open(IMAGES, mode="a")

We're using pathlib for its helpful functionality that will be very useful in later stages when we download the images.

We use mode="a" to append results the next time we run the script.

Let's begin by saving the data from that first request.

{'result': {'data': {'json': {'nextCursor': '798|170|525|24907786',
    'items': [{'id': 24294279,
    ...

The data we are looking for is result.data.json.items, this is a list of the images.

>>> j['result']['data']['json']['items'][0]
{'id': 24294279,
 'name': '00018-2785547559.png',
 'url': 'bc6700f4-7fd3-4fc8-84ce-dcb154161850',
 'nsfwLevel': 1,
 'width': 1080,
 'height': 1680,
 'hash': 'U7BMuu0000~V%N%M4.OY00_4^*R457E1_4IT',
 'hideMeta': False,
 'hasMeta': True,
 'onSite': False,
 'generationProcess': 'img2img',
 'createdAt': '2024-08-14T17:20:13.567Z',
 'sortAt': '2024-08-14T17:20:49.238Z',
 'mimeType': 'image/png',
 'type': 'image',
 'metadata': {'hash': 'U7BMuu0000~V%N%M4.OY00_4^*R457E1_4IT',
  'size': 1928858,
  'width': 1080,
  'height': 1680},
 'ingestion': 'Scanned',
 'scannedAt': '2024-08-14T17:20:22.351Z',
 'needsReview': None,
 'postId': 5424571,
 'postTitle': None,
 'index': 1,
 'publishedAt': '2024-08-14T17:20:49.238Z',
 'modelVersionId': None,
 'availability': 'Public',
 'user': {'id': 1409647,
  'username': 'Tommu',
  'image': 'e162fba0-61d0-4884-8da6-a2ad760e2b4f',
  'deletedAt': None,
  'cosmetics': [{'data': None,
    'cosmetic': {'id': 106,
     'data': {'url': 'e34e2479-8a48-4b7b-8e62-31a70fe1490c'},
     'type': 'Badge',
     'source': 'Trophy',
     'name': 'Bronze Generator Badge'}}],
  'profilePicture': None},
 'stats': {'cryCountAllTime': 154,
  'laughCountAllTime': 312,
  'likeCountAllTime': 3302,
  'dislikeCountAllTime': 0,
  'heartCountAllTime': 1277,
  'commentCountAllTime': 0,
  'collectedCountAllTime': 165,
  'tippedAmountCountAllTime': 360,
  'viewCountAllTime': 0},
 'reactions': [],
 'tags': None,
 'tagIds': [4,
  5262,
  3642,
  3628,
  122902,
  161864,
  162034,
  5148,
  16001,
  120104,
  3629,
  111850,
  2687,
  114877,
  6997,
  161945,
  115513,
  111656,
  5453,
  1853,
  27182,
  140255,
  112994,
  162490,
  531,
  118075,
  153354,
  119779,
  162489,
  5773,
  7084,
  4213,
  161952,
  5784,
  163963,
  120504],
 'cosmetic': None}

You may notice the url field is just a uuid, the full url must be built from this and other metadata. Let's head back to the Explore all images page for some more recon, as there are images on this page we are looking for an example url so that we can determine the format. Often the easiest way to do this is just right click an image and copy the image link.

https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/anim=false,width=450/00018-2785547559.jpeg

We recognize some of this from the data:

url

bc6700f4-7fd3-4fc8-84ce-dcb154161850

name

00018-2785547559.jpeg

The extension of the copied image url is jpeg; this either means a jpeg version is being served, or the extension is wrong as you may have experienced before when you've saved a webp thinking you're getting a jpg.

https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/

This part looks like a prefix, we can confirm xG1nkqKTMzGDvpLrqFT7WA doesn't change by checking some more image urls.

anim=false,width=450

This part appears to affect the size of the image, let's test what happens when we remove it:

https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/00018-2785547559.jpeg

Cool! we're getting the full size image.

Let's test with a different extension:

https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/00018-2785547559.png

You'll notice the file size is the same. This means the name in our data is the original filename and the server is configured to return the image it has under any extension. We will stick to using jpeg as the extension as there is likely some caching in place behind the scenes based on the url meaning the jpeg extension will load faster.

So, our image url format is something like:

"https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"

We can process each record before saving it to replace the url with this format, or process it at a later point. Let's do it as we save the records.

url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"

for image in j['result']['data']['json']['items']:
    image["url"] = url_format.format(
        url=image["url"], name=image["name"].split(".")[0]
    )
    writer.write(image)

Great work! At this point we could also process other fields or remove ones we don't want. There's something more important though, what about duplicate records? The current payload is for retrieving images from period: Week with sort: Most Reactions, naturally this will change over time and as users react to images. We need to keep track of what we already have.

If we were using a database like MongoDB or PostgresSQL we could simply add a unique index on the image's id, we can accomplish the same thing by storing the id of images we've already seen. We'll use a set for performance compared to a list.

image_ids = set()

for image in j['result']['data']['json']['items']:
    if image['id'] in image_ids:
        continue
    image_ids.add(image['id'])
    image["url"] = url_format.format(
        url=image["url"], name=image["name"].split(".")[0]
    )
    writer.write(image)

Pretty simple changes, if the image's id is already in the set we skip it. However, we need this to be more robust so we can restart the script at any point. We'll use regular json for this:

IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
    image_ids = set(json.load(IMAGE_IDS.open()))
else:
    image_ids = set()

and at the end of our script we'll use something like:

json.dump(list(image_ids), IMAGE_IDS.open("a"))
writer.close()

We also close the jsonlines file at that point.

Putting everything so far together we have something like:

from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote

base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
    image_ids = set(json.load(IMAGE_IDS.open()))
else:
    image_ids = set()

writer = jsonlines.open(IMAGES, mode="a")

url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
    "json": {
        "period": "Week",
        "sort": "Most Reactions",
        "types": ["image"],
        "browsingLevel": 1,
        "include": ["cosmetics"],
        "cursor": None,
    },
    "meta": {"values": {"cursor": ["undefined"]}},
}
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'

r = requests.get(f"{url}?{query}", impersonate="chrome")
j = r.json()

for image in j['result']['data']['json']['items']:
    if image['id'] in image_ids:
        continue
    image_ids.add(image['id'])
    image["url"] = url_format.format(
        url=image["url"], name=image["name"].split(".")[0]
    )
    writer.write(image)

json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()

Great! Now how do we get more images? You may have noticed nextCursor in the data and cursor as part of the payload.

{'result': {'data': {'json': {'nextCursor': '799|152|530|24469485',

That's what we'll be using, we'll need to set up a loop, replacing the cursor for each subsequent request. We'll use a while loop, but we need to consider a stop condition; we can do some further recon and research to figure that out.

Head back to the Explore all images with Developer Tools open as before, and apply some filters to get a smaller set of results, something like Time Period: Day and Base Model: PixArt E should work. Yep, only a few results, let's check the response: nextCursor is null or None in Python, that's our stop condition. Let's implement the loop:

cursor = None
process = True

while process:
    payload['json']['cursor'] = cursor
    query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
    r = requests.get(f"{url}?{query}", impersonate="chrome")
    j = r.json()
    for image in j['result']['data']['json']['items']:
        if image['id'] in image_ids:
            continue
        image_ids.add(image['id'])
        image["url"] = url_format.format(
            url=image["url"], name=image["name"].split(".")[0]
        )
        writer.write(image)
    process = j['result']['data']['json']['nextCursor'] is not None
    cursor = j['result']['data']['json']['nextCursor']

We've set the initial cursor, for each iteration we set cursor in the payload then cursor is set to the value of nextCursor. We set process by checking whether nextCursor is None.

Awesome! Our code in full now looks something like:

from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote

base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
    image_ids = set(json.load(IMAGE_IDS.open()))
else:
    image_ids = set()

writer = jsonlines.open(IMAGES, mode="a")

url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
    "json": {
        "period": "Week",
        "sort": "Most Reactions",
        "types": ["image"],
        "browsingLevel": 1,
        "include": ["cosmetics"],
        "cursor": None,
    },
    "meta": {"values": {"cursor": ["undefined"]}},
}

cursor = None
process = True

while process:
    payload['json']['cursor'] = cursor
    query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
    r = requests.get(f"{url}?{query}", impersonate="chrome")
    j = r.json()
    for image in j['result']['data']['json']['items']:
        if image['id'] in image_ids:
            continue
        image_ids.add(image['id'])
        image["url"] = url_format.format(
            url=image["url"], name=image["name"].split(".")[0]
        )
        writer.write(image)
    process = j['result']['data']['json']['nextCursor'] is not None
    cursor = j['result']['data']['json']['nextCursor']

json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()

Let's run it and see what happens:

KeyError: 'result'

{'error': {'json': {'message': 'Please use the public API instead: https://github.com/civitai/civitai/wiki/REST-API-Reference',
   'code': -32001,
   'data': {'code': 'UNAUTHORIZED',
    'httpStatus': 401,
    'path': 'image.getInfinite'}}}}

Oh no! there must be something wrong. Let's go back and do some more recon. Scroll down the page to let more images load, check the new request and look at the payload, it looks like the meta field is removed when cursor is not None. Let's sort that out:


while process:
    payload["json"]["cursor"] = cursor
    if cursor is not None:
        _ = payload.pop("meta", None)

Oh no! it's still not working. There must be something else. Our goal is to match the original requests, so let's check out that request in Developer Tools as before. We'll notice a bunch of Request Headers, let's try replicating those.

You'll notice the standard headers, like accept, content-type, but these look special:

x-client:
web
x-client-date:
1724139341492
x-client-version:
4.0.169

Indeed they are. These are custom headers set by CivtiAI's web application.

x-client-date looks like a timestamp, so we should generate this when we send the request.

We'll also add accept, accept-language, content-type and Referer. As we need to generate x-client-date we'll use a function to return the headers:

def headers():
    return {
        "accept": "*/*",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "x-client": "web",
        "x-client-date": str(int(datetime.datetime.now().timestamp() * 1000)),
        "x-client-version": "4.0.169",
        "Referer": "https://civitai.com/images",
    }

Then modify the request to include these headers:

r = requests.get(f"{url}?{query}", headers=headers(), impersonate="chrome")

While we're making changes, let's add some basic progress report with a print:

    ...
    process = j["result"]["data"]["json"]["nextCursor"] is not None
    cursor = j["result"]["data"]["json"]["nextCursor"]
    print(len(image_ids))

To recap, our code now looks something like this:

from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote
import datetime


def headers():
    return {
        "accept": "*/*",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "x-client": "web",
        "x-client-date": str(int(datetime.datetime.now().timestamp() * 1000)),
        "x-client-version": "4.0.169",
        "Referer": "https://civitai.com/images",
    }


base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
    image_ids = set(json.load(IMAGE_IDS.open()))
else:
    image_ids = set()

writer = jsonlines.open(IMAGES, mode="a")

url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
    "json": {
        "period": "Week",
        "sort": "Most Reactions",
        "types": ["image"],
        "browsingLevel": 1,
        "include": ["cosmetics"],
        "cursor": None,
    },
    "meta": {"values": {"cursor": ["undefined"]}},
}

cursor = None
process = True
while process:
    payload["json"]["cursor"] = cursor
    if cursor is not None:
        _ = payload.pop("meta", None)
    query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
    r = requests.get(f"{url}?{query}", headers=headers(), impersonate="chrome")
    j = r.json()
    for image in j["result"]["data"]["json"]["items"]:
        if image["id"] in image_ids:
            continue
        image_ids.add(image["id"])
        image["url"] = url_format.format(
            url=image["url"], name=image["name"].split(".")[0]
        )
        writer.write(image)
    process = j["result"]["data"]["json"]["nextCursor"] is not None
    cursor = j["result"]["data"]["json"]["nextCursor"]
    print(len(image_ids))

json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()

Time to run the script again 🤞

100
200
300
...
2598
2698
2798
...

Wow! Awesome! Data acquired 😎

We've learned the importance of matching not only the request payload but the headers too, which can distinguish your request from the original, and how to efficiently save our acquired data while keeping track of data we already have. Great work!

We'll take a short break at this point while we prepare for stage 3, where we'll refine our process; adding error checking, better progress and more!