Datasets documentation

Load text data

You are viewing v3.1.0 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Load text data

This guide shows you how to load text datasets. To learn how to load any type of dataset, take a look at the general loading guide.

Text files are one of the most common file types for storing a dataset. By default, 🤗 Datasets samples a text file line by line to build the dataset.

>>> from datasets import load_dataset
>>> dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})

# Load from a directory
>>> dataset = load_dataset("text", data_dir="path/to/text/dataset")

To sample a text file by paragraph or even an entire document, use the sample_by parameter:

# Sample by paragraph
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph")

# Sample by document
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document")

You can also use grep patterns to load specific files:

>>> from datasets import load_dataset
>>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")

To load remote text files via HTTP, pass the URLs instead:

>>> dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")

To load XML data you can use the “xml” loader, which is equivalent to “text” with sample_by=“document”:

>>> from datasets import load_dataset
>>> dataset = load_dataset("xml", data_files={"train": ["my_xml_1.xml", "my_xml_2.xml"], "test": "my_xml_file.xml"})

# Load from a directory
>>> dataset = load_dataset("xml", data_dir="path/to/xml/dataset")
< > Update on GitHub