Datasets documentation

Command Line Interface (CLI)

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v2.20.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Command Line Interface (CLI)

🤗 Datasets provides a command line interface (CLI) with useful shell commands to interact with your dataset.

You can check the available commands:

>>> datasets-cli --help
usage: datasets-cli <command> [<args>]

positional arguments:
  {convert,env,test,run_beam,dummy_data,convert_to_parquet}
                        datasets-cli command helpers
    convert             Convert a TensorFlow Datasets dataset to a HuggingFace Datasets dataset.
    env                 Print relevant system environment info.
    test                Test dataset implementation.
    run_beam            Run a Beam dataset processing pipeline
    dummy_data          Generate dummy data.
    convert_to_parquet  Convert dataset to Parquet
    delete_from_hub     Delete dataset config from the Hub

optional arguments:
  -h, --help            show this help message and exit

Convert to Parquet

Easily convert your Hub script-based dataset to Parquet data-only dataset, so that the dataset viewer will be supported.

>>> datasets-cli convert_to_parquet --help
usage: datasets-cli <command> [<args>] convert_to_parquet [-h] [--token TOKEN] [--revision REVISION] [--trust_remote_code] dataset_id

positional arguments:
  dataset_id           source dataset ID, e.g. USERNAME/DATASET_NAME or ORGANIZATION/DATASET_NAME

optional arguments:
  -h, --help           show this help message and exit
  --token TOKEN        access token to the Hugging Face Hub (defaults to logged-in user's one)
  --revision REVISION  source revision
  --trust_remote_code  whether to trust the code execution of the load script

This command:

  • makes a copy of the script on the “main” branch into a dedicated branch called “script” (if it does not already exist)
  • creates a pull request to the Hub dataset to convert it to Parquet files (and deletes the script from the main branch)

If in the future you need to recreate the Parquet files from the “script” branch, pass the --revision script argument.

Note that you should pass the --trust_remote_code argument only if you trust the remote code to be executed locally on your machine.

For example:

>>> datasets-cli convert_to_parquet USERNAME/DATASET_NAME

Do not forget that you need to log in first to your Hugging Face account:

>>> huggingface-cli login

Delete from Hub

Delete a dataset configuration from a data-only dataset on the Hub.

>>> datasets-cli delete_from_hub --help
usage: datasets-cli <command> [<args>] delete_from_hub [-h] [--token TOKEN] [--revision REVISION] dataset_id config_name

positional arguments:
  dataset_id           source dataset ID, e.g. USERNAME/DATASET_NAME or ORGANIZATION/DATASET_NAME
  config_name          config name to delete

optional arguments:
  -h, --help           show this help message and exit
  --token TOKEN        access token to the Hugging Face Hub
  --revision REVISION  source revision

For example:

>>> datasets-cli delete_from_hub USERNAME/DATASET_NAME CONFIG_NAME

Do not forget that you need to log in first to your Hugging Face account:

>>> huggingface-cli login
< > Update on GitHub