Dataset viewer documentation

Server infrastructure

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Server infrastructure

The dataset viewer has two main components that work together to return queries about a dataset instantly:

  • a user-facing web API for exploring and returning information about a dataset
  • a server runs the queries ahead of time and caches them in a database

While most of the documentation is focused on the web API, the server is crucial because it performs all the time-consuming preprocessing and stores the results so the web API can retrieve and serve them to the user. This saves a user time because instead of generating the response every time it gets requested, the dataset viewer can return the preprocessed results instantly from the cache.

There are three elements that keep the server running: the job queue, workers, and the cache.

Job queue

The job queue is a list of jobs stored in a Mongo database that should be completed by the workers. The jobs are practically identical to the endpoints the user uses; only the server runs the jobs ahead of time, and the user gets the results when they use the endpoint.

There are three jobs:

  • /splits corresponds to the /splits endpoint. It refreshes a dataset and then returns that dataset’s splits and subsets. For every split in the dataset, it’ll create a new job.
  • /first-rows corresponds to the /first-rows endpoint. It gets the first 100 rows and columns of a dataset split.
  • /parquet corresponds to the /parquet endpoint. It downloads the whole dataset, converts it to parquet and publishes the parquet files to the Hub.

You might’ve noticed the /rows and /search endpoints don’t have a job in the queue. The responses from these endpoints are generated on demand.

Workers

Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and subsets. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization.

Take a look at the workers configuration for a complete list of the environment variables if you’re interested in learning more.

Cache

Once the workers complete a job, the results are stored - or cached - in a Mongo database. When a user makes a request with an endpoint like /first-rows, the dataset viewer retrieves the preprocessed response from the cache, and serves it to the user. This eliminates the time a user would’ve waited if the server hadn’t already completed the job and stored the response.

As a result, users can get their requested information about a dataset (even large ones) nearly instantaneously!

< > Update on GitHub