How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o

Community Article Published May 31, 2024

image/png

Today, DuckDB and Hugging Face co-authored an announcement about the new release using hf:// prefix in DuckDB to access datasets in Hugging Face repositories, this spawns a new wave of opportunities to make data more accessible and lightweight for the AI and ML sectors.

You can check out the full announcement here “Access 150k+ Datasets from Hugging Face with DuckDB”.

image/png

In the past, most data in a data warehouse came from within the organization, such as transactional systems, enterprise resource planning (ERP) applications, customer relationship management (CRM) applications, and similar sources.

The structure, volume, and rate of this data were fairly predictable and well-known. However, with the rise of cloud technology, an increasing amount of data now comes from external sources that are less controllable, such as application logs, web applications, mobile devices, social media, and sensor data from the Internet of Things. This data often arrives in schema-less, semi-structured formats. Traditional data warehousing solutions are struggling to handle this new type of data because they rely on deep ETL (extract, transform, load) pipelines and physical tuning, which assume predictable, slow-moving, easily categorized data from mostly internal sources.

Access 150k+ Datasets from Hugging Face with DuckDB and query the data with GPT-4o

In today’s tutorial, we’ll use DuckDB to load data directly from Hugging Face without downloading it to your computer and use WrenAI as an interface connected to GPT-4o, and users can ask business questions to the datasets and get answers; with this, you can access Hugging Face datasets with hf:// path, define semantic meanings through the semantic modeling, and ask any questions about the datasets, the LLM GPT-4o will comprehend your inquiries and query to retrieve and answers.

What is DuckDB?

DuckDB is a fast in-process analytical database, and it has gained a lot of wide adoption in the data and AI community, such as Hugging Face providing DuckDB integration of their Hugging Face Datasets.

image/png

Today, DuckDB is one of the most popular databases on GitHub and also had great momentum in DB-Engine Ranking.

image/png

image/png

With DuckDB, you can easily point to a remote location such as CSV, Excel, JSON, parquet, etc., without moving files from remote locations such as Amazon S3, Azure blob storage, Google Cloud Storage, etc., and analyze them.

Hugging Face Datasets

image/png

The Hugging Face Datasets offers a wide range of datasets from different sources, including academic research, popular benchmark tasks, and real-world applications with more than 150,000 datasets for artificial intelligence. These datasets are curated, processed, and standardized to ensure consistency and ease of use to democratize the access, manipulation, and exploration of datasets used to train and evaluate AI models.

image/png

In this tutorial we uploaded an interesting Billionaires CSV File from CORGIS project, CORGIS project collects some interesting datasets such as COVID-19, billionaires, airlines, etc.

Check out the dataset we use in this tutorial on the Hugging Face — Billionaires dataset.

Let’s get started! 🚀

Using GPT-4o to query Hugging Face Datasets with DuckDB

Get the dataset URL from Hugging Face

First, check the Hugging Face Billionaires dataset here.

image/png

Read using hf:// paths

When working with data, you often need to read files in various formats (such as CSV, JSONL, and Parquet).

Now, it is possible to query them using the hf:// paths as below:

hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩

As of this example, you can use URL below to get the dataset

hf://datasets/chilijung/Billionaires/billionaires.csv

Installing WrenAI

image/png

WrenAI is an open-source text-to-SQL solution for data teams to get results and insights faster by asking business questions without writing SQL.

Next, let’s start installing WrenAI; before we start, you need to install Docker.

1. Install Docker Desktop on your local computer.

Please ensure the version of Docker Desktop is at least >= 4.17.

2. Prepare an OpenAI API key

Please ensure that your Open API key has Full Permission(All).

Visit the OpenAI developer platform.

image/png

Generate a new API key for WrenAI with full permission.

image/png

3. Install WrenAI Launcher

If you are on Mac(using Windows or Linux check here) enter the below command to install the latest WrenAI Launcher.

curl -L https://github.com/Canner/WrenAI/releases/latest/download/wren-launcher-darwin.tar.gz | tar -xz && ./wren-launcher-darwin

The launcher will then ask for your OpenAI API key as below, paste your key into the command and hit enter.

Select gpt-4o

Now you can select gpt-4o , gpt-4-turbo , gpt-3.5-turbo of OpenAI’s generation model in WrenAI.

image/png

Now you’ll see we are running docker-compose on your computer; after the installation, the tool will automatically open up your browser to access WrenAI.

image/png

Setup DuckDB connection

While the terminal is successfully installed, it will launch the browser

image/png

In the UI, select DuckDB and it will ask you for the connection details.

image/png

Here, you can enter a display name of the dataset, such as Billionaire as an example, and in the Initial SQL statements enter the script below.

The URL is where we previously showed hf:// path:

CREATE TABLE billionaires AS 
  SELECT * FROM 'hf://datasets/chilijung/Billionaires/billionaires.csv';

And hit Next. In the next step, select the tables you want to use in WrenAI.

image/png

Select the table and click Next!

image/png

In this example, we only have one table, so you can skip or just click Finish; but if you have multiple tables, you can define semantic relationships here to help LLMs understand and generate more accurate SQL joins.

Now you’re all set!

image/png

Semantic Modeling on WrenAI UI

In this example, we only have one model (table), so when you click Modeling page at the top, you will see the screen below:

image/png

Now click the billionaires model, and there will be a drawer expand from the right.

image/png

The CORGIS dataset comprehensively describes each column, which we could add into the semantic model.

image/png

Adding the semantic context to the model

image/png

Data Modeling with Complex Schema

If you have multiple models, you can model via the interface below.

image/png

With WrenAI UI, you can model your data within a semantic context. This includes adding descriptions, defining relationships, incorporating calculations, and more. By providing this context, you can help LLMs understand your business terminologies and KPI definitions, reducing errors when combining multiple tables. LLMs can comprehend the data structure hierarchy by learning through relationships, such as whether it is a many-to-one, one-to-many, or many-to-many relationships between tables.

You can define your business KPIs and formulas via calculations in WrenAI.

image/png

Adding semantic relationships between tables.

image/png

Start asking questions

Now let’s switch to the Home page, where you initiate a new thread and start asking questions to WrenAI.

It will then generate the best three options based on your question, as below

image/png

Select one of the options, and it will generate the result below, with a step-by-step breakdown and an explanation.

image/png

Ask follow-up questions based on the results

image/png

That is about it!

Now, with hf:// path, you can connect to up to 150,000+ datasets on Hugging Face directly through WrenAI without fussing around with the files! Pretty awesome, right?


If you love our work, please support and star us on GitHub!

🚀 GitHub: https://github.com/canner/wrenai

🙌 Website: https://www.getwren.ai/

📫 Subscribe: https://blog.getwren.ai/

Don’t forget to give ⭐ WrenAI a star on Github ⭐ if you’ve enjoyed this article, and as always, thank you for reading.

Original article from WrenAI blog