# Dataset Search Client Documentation

This notebook demonstrates how to use the [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API to search for Hugging Face datasets by their column names.

## Introduction

The Hugging Face Hub hosts a vast collection of datasets for various machine learning tasks. These datasets often have different structures and column names. The [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API allows you to find datasets that match specific column structures, which can be incredibly useful for tasks like:

1. Finding datasets suitable for specific machine learning tasks
2. Identifying datasets with compatible structures for transfer learning or data augmentation
3. Exploring the availability of datasets with certain features or labels

By searching based on column names, you can quickly identify datasets that fit your specific needs without having to manually inspect each dataset's structure.

## Setup

First, let's import the necessary libraries and define a `DatasetSearchClient` class which we'll use to call the API (feel free to directly call the API if prefered).

In [94]:
import requests
from typing import List, Dict, Any, Iterator

class DatasetSearchClient:
 def __init__(self, base_url: str = "https://librarian-bots-dataset-column-search-api.hf.space"):
 self.base_url = base_url

 def search(self,
 columns: List[str],
 match_all: bool = False,
 page_size: int = 100) -> Iterator[Dict[str, Any]]:
 """
 Search datasets using the provided API, automatically handling pagination.

 Args:
 columns (List[str]): List of column names to search for.
 match_all (bool, optional): If True, match all columns. If False, match any column. Defaults to False.
 page_size (int, optional): Number of results per page. Defaults to 100.

 Yields:
 Dict[str, Any]: Each dataset result from all pages.

 Raises:
 requests.RequestException: If there's an error with the HTTP request.
 ValueError: If the API returns an unexpected response format.
 """
 page = 1
 total_results = None

 while total_results is None or (page - 1) * page_size < total_results:
 params = {
 "columns": columns,
 "match_all": str(match_all).lower(),
 "page": page,
 "page_size": page_size
 }

 try:
 response = requests.get(f"{self.base_url}/search", params=params)
 response.raise_for_status()
 data = response.json()

 if not {"total", "page", "page_size", "results"}.issubset(data.keys()):
 raise ValueError("Unexpected response format from the API")

 if total_results is None:
 total_results = data['total']

 for dataset in data['results']:
 yield dataset

 page += 1

 except requests.RequestException as e:
 raise requests.RequestException(f"Error connecting to the API: {str(e)}")
 except ValueError as e:
 raise ValueError(f"Error processing API response: {str(e)}")

# Create an instance of the client
client = DatasetSearchClient()

## Example 1: Searching for Text Classification Datasets

Let's start by searching for datasets that have both "text" and "label" columns, which are common in text classification tasks:

In [95]:
text_classification_columns = ["text", "label"]
results = client.search(text_classification_columns, match_all=True)

print("Datasets suitable for text classification (with 'text' and 'label' columns):")
for i, dataset in enumerate(results, 1):
 print(f"{i}. {dataset['hub_id']}: {dataset['column_names']}")
 if i >= 5: # Print only the first 5 as a sample
 break

total_results = len(list(client.search(text_classification_columns, match_all=True)))
print(f"\nTotal datasets found: {total_results}")

Datasets suitable for text classification (with 'text' and 'label' columns):
1. mteb/amazon_counterfactual: ['text', 'label', 'label_text']
2. dair-ai/emotion: ['text', 'label']
3. stanfordnlp/imdb: ['text', 'label']
4. 203427as321/articles: ['label', 'text', '__index_level_0__']
5. indonlp/NusaX-senti: ['id', 'text', 'lang', 'label']

Total datasets found: 1866


## Example 2: Searching for Question-Answering Datasets

Now, let's search for datasets that could be used for question-answering tasks:

In [97]:
qa_columns = ["question", "answer", "context"]
results = client.search(qa_columns, match_all=True)

print("Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):")
for i, dataset in enumerate(results, 1):
 print(f"{i}. {dataset['hub_id']}: {dataset['column_names']}")
 if i >= 5: # Print only the first 5 as a sample
 break

total_results = len(list(client.search(qa_columns, match_all=True)))
print(f"\nTotal datasets found: {total_results}")

Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):
1. hotpotqa/hotpot_qa: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context']
2. neural-bridge/rag-dataset-12000: ['context', 'question', 'answer']
3. ryo0634/xquad-sampled: ['id', 'question', 'context', 'answer_sentence', 'answer']
4. lcw99/wikipedia-korean-20240501-1million-qna: ['question', 'answer', 'context']
5. virattt/financial-qa-10K: ['question', 'answer', 'context', 'ticker', 'filing']

Total datasets found: 646


## Example 3: Searching for Instruction-Following Datasets

Let's search for datasets that could be used for instruction-following tasks, which are common in training large language models:

In [98]:
instruction_columns = ["instruction", "input", "output"]
results = client.search(instruction_columns, match_all=True)

print("Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):")
for i, dataset in enumerate(results, 1):
 print(f"{i}. {dataset['hub_id']}: {dataset['column_names']}")
 if i >= 5: # Print only the first 5 as a sample
 break

total_results = len(list(client.search(instruction_columns, match_all=True)))
print(f"\nTotal datasets found: {total_results}")

Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):
1. garage-bAInd/Open-Platypus: ['input', 'output', 'instruction', 'data_source']
2. HuggingFaceH4/databricks_dolly_15k: ['category', 'instruction', 'input', 'output']
3. chargoddard/alpaca-gpt4-500: ['instruction', 'input', 'output', 'text', '__index_level_0__']
4. vicgalle/alpaca-gpt4: ['instruction', 'input', 'output', 'text']
5. llamafactory/alpaca_en: ['instruction', 'input', 'output']

Total datasets found: 1937


# Creating collections for common dataset formats

We can also use the API to create a Hugging Face Collection based on our search. Let's use an alpaca formatted dataset as an example:

alpaca
```
{"instruction": "...", "input": "...", "output": "..."}
```


In [99]:
alpaca = ['instruction', 'input', 'output']

In [100]:
results = list(client.search(alpaca, match_all=True))
len(results)

1937

We now import some functions from `huggingface_hub` to create a collection.

In [25]:
from huggingface_hub import login, create_collection, add_collection_item

I have my HF_TOKEN stored as a Secret in Colab. You can also login by calling `login()` directly.

In [102]:
from google.colab import userdata

In [103]:
login(userdata.get('HF_TOKEN'))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


We create a collection using `create_collection`. WE

In [108]:
collection = create_collection("Probably Alpaca Style Datasets", exists_ok=True)

In [109]:
collection.title

'Probably Alpaca Style Datasets'

In [110]:
collection.slug

'davanstrien/probably-alpaca-style-datasets-667eead1bad3a964ea580e04'

We now loop through our results and add them to the Collection.

In [None]:
for result in results:
 add_collection_item(collection.slug, result['hub_id'], item_type="dataset", exists_ok=True)

Since the results have some key metadata about the dataset you can also filter the results further before creating a Collection.