ClickHouse

ClickHouse is a fast and efficient column-oriented database for analytical workloads, making it easy to analyze Hub-hosted datasets with SQL. To get started quickly, use clickhouse-local to run SQL queries from the command line and avoid the need to fully install ClickHouse.

Check this blog for more details about how to analyze datasets on the Hub with ClickHouse.

To start, download and install clickhouse-local:

curl https://clickhouse.com/ | sh

For this example, you’ll analyze the maharshipandya/spotify-tracks-dataset which contains information about Spotify tracks. Datasets on the Hub are stored as Parquet files and you can access it with the /parquet endpoint:

import requests

r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=maharshipandya/spotify-tracks-dataset")
j = r.json()
url = [f['url'] for f in j['parquet_files']]
url
['https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet']

Aggregate functions

Now you can begin to analyze the dataset. Use the -q argument to specify the query to execute, and the url function to create a table from the data in the Parquet file.

You should set enable_url_encoding to 0 to ensure the escape characters in the URL are preserved as intended, and max_https_get_redirects to 1 to redirect to the path of the Parquet file.

Let’s start by identifying the most popular artists:

./clickhouse local -q "
    SELECT count() AS c, artists 
    FROM url('https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet') 
    GROUP BY artists 
    ORDER BY c 
    DESC LIMIT 5
    SETTINGS enable_url_encoding=0, max_http_get_redirects=1"

┌───c─┬─artists─────────┐
│ 279 │ The Beatles 	│
│ 271 │ George Jones	│
│ 236 │ Stevie Wonder   │
│ 224 │ Linkin Park 	│
│ 222 │ Ella Fitzgerald │
└─────┴─────────────────┘

ClickHouse also provides functions for visualizing your queries. For example, you can use the bar function to create a bar chart of the danceability of songs:

./clickhouse local -q "
    SELECT
        round(danceability, 1) AS danceability,
        bar(count(), 0, max(count()) OVER ()) AS dist
    FROM url('https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet')
    GROUP BY danceability
    ORDER BY danceability ASC
    SETTINGS enable_url_encoding=0, max_http_get_redirects=1"

┌─danceability─┬─dist─────────────────────────────────────────────────────────────────────────────────┐
│            0 │ ▍                                                                            	      │
│      	   0.1 │ ████▎                                                                        	      │
│      	   0.2 │ █████████████▍                                                               	      │
│      	   0.3 │ ████████████████████████                                                     	      │
│      	   0.4 │ ████████████████████████████████████████████▋                                	      │
│      	   0.5 │ ████████████████████████████████████████████████████████████████████▊        	      │
│      	   0.6 │ ████████████████████████████████████████████████████████████████████████████████     │
│      	   0.7 │ ██████████████████████████████████████████████████████████████████████       	      │
│      	   0.8 │ ██████████████████████████████████████████                                   	      │
│      	   0.9 │ ██████████▋                                                                  	      │
│            1 │ ▌                                                                            	      │
└──────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

To get a deeper understanding about a dataset, ClickHouse provides statistical analysis functions for determining how your data is correlated, calculating statistical hypothesis tests, and more. Take a look at ClickHouse’s List of Aggregate Functions for a complete list of available aggregate functions.

User-defined function (UDFs)

A user-defined function (UDF) allows you to reuse custom logic. Many Hub datasets are often sharded into more than one Parquet file, so it can be easier and more efficient to create a UDF to list and query all the Parquet files of a given dataset from just the dataset name.

For this example, you’ll need to run clickhouse-local in console mode so the UDF persists between queries:

./clickhouse local

Remember to set enable_url_encoding to 0 and max_https_get_redirects to 1 to redirect to the path of the Parquet files:

SET max_http_get_redirects = 1, enable_url_encoding = 0

Let’s create a function to return a list of Parquet files from the tasksource/blog_authorship_corpus:

CREATE OR REPLACE FUNCTION hugging_paths AS dataset -> (
    SELECT arrayMap(x -> (x.1), JSONExtract(json, 'parquet_files', 'Array(Tuple(url String))'))
    FROM url('https://datasets-server.huggingface.co/parquet?dataset=' || dataset, 'JSONAsString')
);

SELECT hugging_paths('tasksource/blog_authorship_corpus') AS paths

['https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet','https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet']

You can make this even easier by creating another function that calls hugging_paths and outputs all the files based on the dataset name:

CREATE OR REPLACE FUNCTION hf AS dataset -> (
    WITH hugging_paths(dataset) as urls
    SELECT multiIf(length(urls) = 0, '', length(urls) = 1, urls[1], 'https://huggingface.co/datasets/{' || arrayStringConcat(arrayMap(x -> replaceRegexpOne(replaceOne(x, 'https://huggingface.co/datasets/', ''), '\\.parquet$', ''), urls), ',') || '}.parquet')
);

SELECT hf('tasksource/blog_authorship_corpus') AS pattern

https://huggingface.co/datasets/{tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000,tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001}.parquet

Now use the hf function to query any dataset by passing the dataset name:

SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length 
FROM url(hf('tasksource/blog_authorship_corpus'))
GROUP BY sign 
ORDER BY avg_blog_length 
DESC LIMIT(5) 

┌───────────┬────────┬────────────────────┐
│  sign     │ count  │ avg_blog_length    │
├───────────┼────────┼────────────────────┤
│ Aquarius  │ 49687  │ 1193.9523819107615 │
│ Leo       │ 53811  │ 1186.0665291483153 │
│ Cancer    │ 65048  │ 1160.8010392325666 │
│ Gemini    │ 51985  │ 1158.4132922958545 │
│ Vurgi     │ 60399  │ 1142.9977648636566 │
└───────────┴────────┴────────────────────┘

< > Update on GitHub

Dataset viewer

ClickHouse

Aggregate functions

User-defined function (UDFs)