File size: 19,221 Bytes
1e6410a
90d43bc
 
90a0d81
90d43bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58e8885
90d43bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44dd814
 
90d43bc
 
 
 
 
 
 
 
 
 
 
1e6410a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
import streamlit as st
# %%
# Datasets installation
#! pip install datasets transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/datasets.git

# %% [markdown]
# # Quickstart

# %% [markdown]
# This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate πŸ€— Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](https://huggingface.co/docs/datasets/main/en/./tutorial), where you'll get a more thorough introduction.
# 
# Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use πŸ€— Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
# 
# <div class="mt-4">
#    <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-3 md:gap-y-4 md:gap-x-5">
#       <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#audio"
#          ><div class="w-full text-center bg-gradient-to-r from-violet-300 via-sky-400 to-green-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Audio</div>
#          <p class="text-gray-700">Resample an audio dataset and get it ready for a model to classify what type of banking issue a speaker is calling about.</p>
#       </a>
#       <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#vision"
#          ><div class="w-full text-center bg-gradient-to-r from-pink-400 via-purple-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Vision</div>
#          <p class="text-gray-700">Apply data augmentation to an image dataset and get it ready for a model to diagnose disease in bean plants.</p>
#       </a>
#       <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="#nlp"
#          ><div class="w-full text-center bg-gradient-to-r from-orange-300 via-red-400 to-violet-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">NLP</div>
#          <p class="text-gray-700">Tokenize a dataset and get it ready for a model to determine whether a pair of sentences have the same meaning.</p>
#       </a>
#    </div>
# </div>
# 
# <Tip>
# 
# Check out [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course to learn more about other important topics such as loading remote or local datasets, tools for cleaning up a dataset, and creating your own dataset.
# 
# </Tip>
# 
# Start by installing πŸ€— Datasets:
# 
# ```bash
# pip install datasets
# ```
# 
# πŸ€— Datasets also support audio and image data formats:
# 
# * To work with audio datasets, install the [Audio](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Audio) feature:
# 
#    ```bash
#    pip install datasets[audio]
#    ```
# 
# * To work with image datasets, install the [Image](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Image) feature:
# 
#    ```bash
#    pip install datasets[vision]
#    ```
# 
# Besides πŸ€— Datasets, make sure your preferred machine learning framework is installed:
# 
# ```bash
# pip install torch
# ```
# ```bash
# pip install tensorflow
# ```

# %% [markdown]
# ## Audio

# %% [markdown]
# Audio datasets are loaded just like text datasets. However, an audio dataset is preprocessed a bit differently. Instead of a tokenizer, you'll need a [feature extractor](https://huggingface.co/docs/transformers/main_classes/feature_extractor#feature-extractor). An audio input may also require resampling its sampling rate to match the sampling rate of the pretrained model you're using. In this quickstart, you'll prepare the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset for a model train on and classify the banking issue a customer is having.
# 
# **1**. Load the MInDS-14 dataset by providing the [load_dataset()](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function with the dataset name, dataset configuration (not all datasets will have a configuration), and a dataset split:

# %%
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", "en-US", split="train", trust_remote_code=True)

# %% [markdown]
# **2**. Next, load a pretrained [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model and its corresponding feature extractor from the [πŸ€— Transformers](https://huggingface.co/transformers/) library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.

# %%
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

model = AutoModelForAudioClassification.from_pretrained("facebook/wav2vec2-base")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

# %% [markdown]
# **3**. The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset card indicates the sampling rate is 8kHz, but the Wav2Vec2 model was pretrained on a sampling rate of 16kHZ. You'll need to upsample the `audio` column with the [cast_column()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.cast_column) function and [Audio](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Audio) feature to match the model's sampling rate.

# %%
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset[0]["audio"]

# %% [markdown]
# **4**. Create a function to preprocess the audio `array` with the feature extractor, and truncate and pad the sequences into tidy rectangular tensors. The most important thing to remember is to call the audio `array` in the feature extractor since the `array` - the actual speech signal - is the model input.
# 
# Once you have a preprocessing function, use the [map()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to speed up processing by applying the function to batches of examples in the dataset.

# %%
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        padding=True,
        max_length=100000,
        truncation=True,
    )
    return inputs

dataset = dataset.map(preprocess_function, batched=True)

# %% [markdown]
# **5**. Use the [rename_column()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.rename_column) function to rename the `intent_class` column to `labels`, which is the expected input name in [Wav2Vec2ForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2#transformers.Wav2Vec2ForSequenceClassification):

# %%
dataset = dataset.rename_column("intent_class", "labels")

# %% [markdown]
# **6**. Set the dataset format according to the machine learning framework you're using.
# 
# Use the [set_format()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.set_format) function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):

# %%
from torch.utils.data import DataLoader

dataset.set_format(type="torch", columns=["input_values", "labels"])
dataloader = DataLoader(dataset, batch_size=4)

# %% [markdown]
# Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from πŸ€— Transformers to prepare the dataset to be compatible with
# TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
# with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

# %%
import tensorflow as tf

tf_dataset = model.prepare_tf_dataset(
    dataset,
    batch_size=4,
    shuffle=True,
)

# %% [markdown]
# **7**. Start training with your machine learning framework! Check out the πŸ€— Transformers [audio classification guide](https://huggingface.co/docs/transformers/tasks/audio_classification) for an end-to-end example of how to train a model on an audio dataset.

# %% [markdown]
# ## Vision

# %% [markdown]
# Image datasets are loaded just like text datasets. However, instead of a tokenizer, you'll need a [feature extractor](https://huggingface.co/docs/transformers/main_classes/feature_extractor#feature-extractor) to preprocess the dataset. Applying data augmentation to an image is common in computer vision to make the model more robust against overfitting. You're free to use any data augmentation library you want, and then you can apply the augmentations with πŸ€— Datasets. In this quickstart, you'll load the [Beans](https://huggingface.co/datasets/beans) dataset and get it ready for the model to train on and identify disease from the leaf images.
# 
# **1**. Load the Beans dataset by providing the [load_dataset()](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function with the dataset name and a dataset split:

# %%
from datasets import load_dataset, Image

dataset = load_dataset("beans", split="train")

# %% [markdown]
# **2**. Now you can add some data augmentations with any library ([Albumentations](https://albumentations.ai/), [imgaug](https://imgaug.readthedocs.io/en/latest/), [Kornia](https://kornia.readthedocs.io/en/latest/)) you like. Here, you'll use [torchvision](https://pytorch.org/vision/stable/transforms.html) to randomly change the color properties of an image:

# %%
from torchvision.transforms import Compose, ColorJitter, ToTensor

jitter = Compose(
    [ColorJitter(brightness=0.5, hue=0.5), ToTensor()]
)

# %% [markdown]
# **3**. Create a function to apply your transform to the dataset and generate the model input: `pixel_values`.

# %%
def transforms(examples):
    examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
    return examples

# %% [markdown]
# **4**. Use the [with_transform()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.with_transform) function to apply the data augmentations on-the-fly:

# %%
dataset = dataset.with_transform(transforms)

# %% [markdown]
# **5**. Set the dataset format according to the machine learning framework you're using.
# 
# Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader). You'll also need to create a collate function to collate the samples into batches:

# %%
from torch.utils.data import DataLoader

def collate_fn(examples):
    images = []
    labels = []
    for example in examples:
        images.append((example["pixel_values"]))
        labels.append(example["labels"])

    pixel_values = torch.stack(images)
    labels = torch.tensor(labels)
    return {"pixel_values": pixel_values, "labels": labels}
dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)

# %% [markdown]
# Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from πŸ€— Transformers to prepare the dataset to be compatible with
# TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
# with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
# 
# Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
# 
# ```bash
# pip install -U albumentations opencv-python
# ```

# %%
import albumentations
import numpy as np

transform = albumentations.Compose([
    albumentations.RandomCrop(width=256, height=256),
    albumentations.HorizontalFlip(p=0.5),
    albumentations.RandomBrightnessContrast(p=0.2),
])

def transforms(examples):
    examples["pixel_values"] = [
        transform(image=np.array(image))["image"] for image in examples["image"]
    ]
    return examples

dataset.set_transform(transforms)
tf_dataset = model.prepare_tf_dataset(
    dataset,
    batch_size=4,
    shuffle=True,
)

# %% [markdown]
# **6**. Start training with your machine learning framework! Check out the πŸ€— Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.

# %% [markdown]
# ## NLP

# %% [markdown]
# Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.
# 
# **1**. Load the MRPC dataset by providing the [load_dataset()](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function with the dataset name, dataset configuration (not all datasets will have a configuration), and dataset split:

# %%
from datasets import load_dataset

dataset = load_dataset("glue", "mrpc", split="train")

# %% [markdown]
# **2**. Next, load a pretrained [BERT](https://huggingface.co/bert-base-uncased) model and its corresponding tokenizer from the [πŸ€— Transformers](https://huggingface.co/transformers/) library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.

# %%
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# %%
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# %% [markdown]
# **3**. Create a function to tokenize the dataset, and you should also truncate and pad the text into tidy rectangular tensors. The tokenizer generates three new columns in the dataset: `input_ids`, `token_type_ids`, and an `attention_mask`. These are the model inputs.
# 
# Use the [map()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to speed up processing by applying your tokenization function to batches of examples in the dataset:

# %%
def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

dataset = dataset.map(encode, batched=True)
dataset[0]

# %% [markdown]
# **4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):

# %%
dataset = dataset.map(lambda examples: {"labels": examples["label"]}, batched=True)

# %% [markdown]
# **5**. Set the dataset format according to the machine learning framework you're using.
# 
# Use the [set_format()](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.set_format) function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):

# %%
# prompt: Set the dataset format according to the machine learning framework you're using.
# Use the set_format() function to set the dataset format to torch and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in torch.utils.data.DataLoader:

dataset.set_format(type="torch", columns=["input_ids", "labels"])
dataloader = DataLoader(dataset, batch_size=4)


# %%
import torch
from torch.utils.data import DataLoader

# %%
# prompt: Set the dataset format to torch and specify the columns to format:
# set_format(parquet_file, "torch", columns=["column1", "column2"])

dataset.set_format(type="torch", columns=["input_ids", "labels"])


# %% [markdown]
# Use the [prepare_tf_dataset](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) method from πŸ€— Transformers to prepare the dataset to be compatible with
# TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) as a `tf.data.Dataset`
# with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

# %%

tf_dataset = model.prepare_tf_dataset(
    dataset,
    batch_size=4,
    shuffle=True,
    trust_remote_code=True)

# %% [markdown]
# **6**. Start training with your machine learning framework! Check out the πŸ€— Transformers [text classification guide](https://huggingface.co/docs/transformers/tasks/sequence_classification) for an end-to-end example of how to train a model on a text dataset.

# %% [markdown]
# ## What's next?

# %% [markdown]
# This completes the πŸ€— Datasets quickstart! You can load any text, audio, or image dataset with a single function and get it ready for your model to train on.
# 
# For your next steps, take a look at our [How-to guides](https://huggingface.co/docs/datasets/main/en/./how_to) and learn how to do more specific things like loading different dataset formats, aligning labels, and streaming large datasets. If you're interested in learning more about πŸ€— Datasets core concepts, grab a cup of coffee and read our [Conceptual Guides](https://huggingface.co/docs/datasets/main/en/./about_arrow)!