DatasetsΒΆ

_images/datasets_logo.png

πŸ€— Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.

Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the Hugging Face Hub, allowing you to easily load and share a dataset with the wider NLP community. There are currently over 900 datasets, and more than 25 metrics available.

Find your dataset today on the Hugging Face Hub, or take an in-depth look inside a dataset with the live Datasets Viewer.

Learn the basics and become familiar with loading, accessing, and processing a dataset. Start here if you are using πŸ€— Datasets for the first time!

Practical guides to help you achieve a specific goal. Take a look at these guides to learn how to use πŸ€— Datasets to solve real-world problems.

High-level explanations for building a better understanding about important topics such as the underlying data format, the cache, and how datasets are generated.

Technical descriptions of how πŸ€— Datasets classes and methods work.