{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "R0RYql3KYHY6",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "3f1a9728-f92c-4f1b-cd1b-d5d5c188cc87"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Cloning into 'nlp-with-transformers'...\n",
"remote: Enumerating objects: 653, done.\u001b[K\n",
"remote: Counting objects: 100% (84/84), done.\u001b[K\n",
"remote: Compressing objects: 100% (75/75), done.\u001b[K\n",
"remote: Total 653 (delta 47), reused 15 (delta 8), pack-reused 569\u001b[K\n",
"Receiving objects: 100% (653/653), 62.41 MiB | 9.79 MiB/s, done.\n",
"Resolving deltas: 100% (335/335), done.\n",
"/content/nlp-with-transformers\n",
"⏳ Installing base requirements ...\n",
"✅ Base requirements installed!\n",
"Using transformers v4.41.1\n",
"Using datasets v2.19.1\n",
"Using accelerate v0.30.1\n",
"Using sentencepiece v0.1.99\n",
"Using umap v0.5.6\n"
]
}
],
"source": [
"# Uncomment and run this cell if you're on Colab or Kaggle\n",
"!git clone https://github.com/rickiepark/nlp-with-transformers.git\n",
"%cd nlp-with-transformers\n",
"from install import *\n",
"install_requirements(chapter=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "93EZHicnYHY8"
},
"source": [
"# Text Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S1ESIABDYHY9"
},
"source": [
"Text classification is one of the most common tasks in NLP; it can be used for a broad range of applications, such as tagging customer feedback into categories or routing support tickets according to their language. Chances are that your email program's spam filter is using text classification to protect your inbox from a deluge of unwanted junk!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HII7N8gBYHY-"
},
"source": [
"Another common type of text classification is sentiment analysis, which (as we saw in <>) aims to identify the polarity of a given text. For example, a company like Tesla might analyze Twitter posts like the one in <> to determine whether people like its new car roofs or not."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8bebtscmYHY-"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "G33gWnRrYHY-"
},
"source": [
"Now imagine that you are a data scientist who needs to build a system that can automatically identify emotional states such as \"anger\" or \"joy\" that people express about your company's product on Twitter. In this chapter, we'll tackle this task using a variant of BERT called DistilBERT.footnote:[V. Sanh et al., [\"DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter\"](https://arxiv.org/abs/1910.01108), (2019).] The main advantage of this model is that it achieves comparable performance to BERT, while being significantly smaller and more efficient. This enables us to train a classifier in a few minutes, and if you want to train a larger BERT model you can simply change the checkpoint of the pretrained model. A _checkpoint_ corresponds to the set of weights that are loaded into a given transformer architecture.\n",
"\n",
"This will also be our first encounter with three of the core libraries from the Hugging Face ecosystem: image:images/logo.png[hf,13,13] Datasets, image:images/logo.png[hf,13,13] Tokenizers, and image:images/logo.png[hf,13,13] Transformers. As shown in <>, these libraries will allow us to quickly go from raw text to a fine-tuned model that can be used for inference on new tweets. So, in the spirit of Optimus Prime, let's dive in, \"transform, and roll out!\"footnote:[Optimus Prime is the leader of a race of robots in the popular Transformers franchise for children (and for those who are young at heart!).]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NM2WOlQrYHY_"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q25GMEArYHY_"
},
"source": [
"## The Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t7hRmIdHYHY_"
},
"source": [
"To build our emotion detector we'll use a great dataset from an article that explored how emotions are represented in English Twitter messages.footnote:[E. Saravia et al., \"CARER: Contextualized Affect Representations for Emotion Recognition,\" _Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing_ (Oct–Nov 2018): 3687–3697, http://dx.doi.org/10.18653/v1/D18-1404.] Unlike most sentiment analysis datasets that involve just \"positive\" and \"negative\" polarities, this dataset contains six basic emotions: anger, disgust, fear, joy, sadness, and surprise. Given a tweet, our task will be to train a model that can classify it into one of these emotions."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TofiY-T-YHZA"
},
"source": [
"### A First Look at Hugging Face Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vt8hW4yOYHZA"
},
"source": [
"We will use image:images/logo.png[hf,13,13] Datasets to download the data from the [Hugging Face Hub](https://huggingface.co/datasets). We can use the `list_datasets()` function to see what datasets are available on the Hub:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "vScH8C-oYHZA",
"outputId": "4eb411d7-df6e-4ce3-cb5b-5e2172017ba0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
":3: FutureWarning: list_datasets is deprecated and will be removed in the next major version of datasets. Use 'huggingface_hub.list_datasets' instead.\n",
" all_datasets = list_datasets()\n",
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
"The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
"To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
"You will be able to reuse this secret in all of your notebooks.\n",
"Please note that authentication is recommended but still optional to access public models or datasets.\n",
" warnings.warn(\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"There are 154069 datasets currently available on the Hub\n",
"The first 10 are: ['amirveyseh/acronym_identification', 'ade_corpus_v2',\n",
"'UCLNLP/adversarial_qa', 'Yale-LILY/aeslc', 'afrikaans_ner_corpus',\n",
"'fancyzhx/ag_news', 'allenai/ai2_arc', 'google/air_dialogue',\n",
"'komari6/ajgt_twitter_ar', 'legacy-datasets/allegro_reviews']\n"
]
}
],
"source": [
"from datasets import list_datasets\n",
"\n",
"all_datasets = list_datasets()\n",
"print(f\"There are {len(all_datasets)} datasets currently available on the Hub\")\n",
"print(f\"The first 10 are: {all_datasets[:10]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t4wG74KKYHZB"
},
"source": [
"We see that each dataset is given a name, so let's load the `emotion` dataset with the `load_dataset()` function:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "qBWCiq_MYHZB",
"outputId": "52eb0e9a-ebf7-454f-f050-29e6cc3ea6e0",
"colab": {
"referenced_widgets": [
"f47afe22942345cfbca95038e4a9a29c",
"c668d1b3ef084d829ebd5eb18f5d76e5",
"814af2241829422ab8bae71941246a31",
"76db805ecc9c48f19eacb766867b2943",
"88741118e7a9423488417a535a5b23b1",
"e3e03d7e97f74f4f9e39c5e95bc584f4",
"79728fc0e1464ff589096f3fcbd34a55",
"ec36860384d24b55b2077b1adc1b5c22",
"546373ff3aec4c54a4dddaf5f79a024d",
"47d36af366e0497ba6eff8314ac7e968",
"9efb8abe081c4bfebb94d06cc4841301",
"198beb4149e74130ba5a2f15618a997b",
"73895c51f62943488fc5498e9954df2b",
"54e7904f153a493f811d63b2f03acbb9",
"1cb003f09aef4ff7a83e0a472f6a2355",
"23b1adaee41046c3aafa03ee99abe83e",
"826adf16768c4bea96a3f2bb1f3c2863",
"5bb1a479c17241d58e500ba8df0b32e6",
"cdd1562f367043eabf81d41000d9f015",
"0175b11aae8243a399a19ccdaf71e6f2",
"c005209050bc4633b289bc9476bdab3f",
"97944e6d3112415abbd436808754cee4",
"d4b2e25e50124c85a9fdbe3e4bcb51cb",
"e3ea0332447149eea8c36cb0aba03014",
"862725ce0c084a5ba20dd86dbc591f3d",
"85bde7a6dcf7438d8d51e82d1165d80f",
"1d41cf0e464e44e59eea7826fc572e3b",
"df923af4729243ca967298db35dc3153",
"321e2acd3575448581083496d4c8c228",
"a64ce83414f0410d8d652a32edd1aa24",
"e78b1cd627f14c16bcdaaa14af27af01",
"42aedb22734948c995e279d0bf222b28",
"4cc9ee042dbc45fe9edca86426bbcd0f",
"f1908fd926534703b716505649a0361e",
"6189d86f51a64a63bc88b78cc362a05a",
"adb93a25f66847408359b0c12aaa4a07",
"7c50c130dcde417f8903e913d5d8ba36",
"2a6cb4e76e1e497ba2cb77d143a8f9e6",
"1b6bd3e9c4374ef5abedf01754c12823",
"db61aab23c374a3695b36948de07b854",
"96721053df274945a91f6252e8d3988b",
"f8e00d5b9ba8438aa950275b719cb8ce",
"cc40a121d268416c899e4643546109af",
"dd7553c0f9d949dab86e767f414eaa61",
"4b6d1349b9a2470e9b736780041aeddd",
"5cca997c6acd4b6f9057b995e560cdd8",
"91a4c486f8684daf8e891c1a14a95b42",
"be7d5f24c6414eac8cdb5b473e7cf87c",
"e1512636dcbb410c83b3ccffdef69045",
"d6fdb27f38ab44e6a3f8524a91d7b41f",
"0961eae5128549319e5226b902a557ba",
"955e2168cb364381b986b74d3620d8be",
"a0177b17e42e492daa7c3cccef52cec6",
"fdc27730fb3a438cbba5f471f88bdeaa",
"9dddcb72e8744d44b51b6657ac4af1d0",
"8104b0070c924249a2acc2999f39ac11",
"9b68fa35cd4a4d569d6117190c2b36e8",
"544c2c5c11f04962802b3f2ca609233e",
"f80fc4f920bf45d2a57d35424fdb615b",
"f0b1171dc8584a01be31873d85d72f8b",
"7ba895b4ebef47e2b8cd194162f517fc",
"790797520d9344aeab6e7af63ea54106",
"ea035ca87e854c56b41b885998716536",
"7f055ee7fc544128ad1ad72cf5b69a42",
"f197e68f822348398374b55b053c49c5",
"daae055c36434a4bb3cbd77408ad64e4",
"880a3f168ca7467b9ccf8bec9a30d91f",
"5c827267693546bda76dfe8d4e0c271d",
"22a3659466344e67bc35ed83eca16c2c",
"5446bbd6b658455396c22f2f8b016643",
"341f2ab45a07463785f32d30c32583d5",
"0894d349a79747c4aaab7324572c47bb",
"2cad729e911d4a818cf87859a9a64d36",
"4b170fc372f3497195835e9bba356f6e",
"26bb5a7c3bc444ec87576854353b8edc",
"b8f218baab8544b2920842f59a4ad7a3",
"3d30ab3663394bb7804fabbc222fbffc",
"6d7198a28fdb4d2e9899d64a60fd9d11",
"1628330c73f34e55a260d098490ac847",
"175f5a9a8cb6459087b56c19891f4d66",
"873fc396ac5b4794b8136654bc5904b3",
"cfa78a05b93c43f7b06aabd5c0407b66",
"80bb58f5d11a4d5985850e23e8119e8c",
"2a25f3568eba448e8c384e4b5a574c1f",
"d4ba26d85d3d4181b0e90be0e1056688",
"938d62f90f684fe7af2068f3f6940a46",
"04f078df096345fbb2f0f64851b62ebe",
"92b134ed6f964ee099ced580c279c374",
"83e5e2b4103145709a7efe73b13bd8c5",
"24ad5e23c76c4c2a8b21db6a907e34a1",
"c1f0e60252a2442d842a27edaffb3959",
"0777bfa85afe499b84f65f26781aa51b",
"b518bee67f9d4dd1820fc3a453c0622d",
"56fe9638e0174b9c9dabd6c66e7755e8",
"4d6021340a874090865eebf3bf66a20a",
"14383dad05684aeea50c753f9572baae",
"8dc413c11f3942818008228d130019dc",
"9c62776d3d8e4d6ab6cbd7ea1d4cd7d6",
"85dcf3733a5a4267844ef86d5de9900d"
],
"base_uri": "https://localhost:8080/",
"height": 400
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/datasets/load.py:1486: FutureWarning: The repository for emotion contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/emotion\n",
"You can avoid this message in future by passing the argument `trust_remote_code=True`.\n",
"Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.\n",
" warnings.warn(\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading builder script: 0%| | 0.00/3.97k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "f47afe22942345cfbca95038e4a9a29c"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading metadata: 0%| | 0.00/3.28k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "198beb4149e74130ba5a2f15618a997b"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading readme: 0%| | 0.00/8.78k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "d4b2e25e50124c85a9fdbe3e4bcb51cb"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading data: 0%| | 0.00/592k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "f1908fd926534703b716505649a0361e"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading data: 0%| | 0.00/74.0k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "4b6d1349b9a2470e9b736780041aeddd"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Downloading data: 0%| | 0.00/74.9k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "8104b0070c924249a2acc2999f39ac11"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Generating train split: 0%| | 0/16000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "880a3f168ca7467b9ccf8bec9a30d91f"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Generating validation split: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "6d7198a28fdb4d2e9899d64a60fd9d11"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Generating test split: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "83e5e2b4103145709a7efe73b13bd8c5"
}
},
"metadata": {}
}
],
"source": [
"# hide_output\n",
"from datasets import load_dataset\n",
"\n",
"emotions = load_dataset(\"emotion\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eztk7vGXYHZC"
},
"source": [
"If we look inside our `emotions` object:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "6aqSLnFvYHZC",
"outputId": "17ea651b-3ffc-44ad-8823-7a371deb5c59",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"DatasetDict({\n",
" train: Dataset({\n",
" features: ['text', 'label'],\n",
" num_rows: 16000\n",
" })\n",
" validation: Dataset({\n",
" features: ['text', 'label'],\n",
" num_rows: 2000\n",
" })\n",
" test: Dataset({\n",
" features: ['text', 'label'],\n",
" num_rows: 2000\n",
" })\n",
"})"
]
},
"metadata": {},
"execution_count": 4
}
],
"source": [
"emotions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0inKbwyWYHZC"
},
"source": [
"we see it is similar to a Python dictionary, with each key corresponding to a different split. And we can use the usual dictionary syntax to access an individual split:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "FAcPeFZxYHZC",
"outputId": "4333be5d-cda5-47c3-dbb5-4effccd09fd5",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Dataset({\n",
" features: ['text', 'label'],\n",
" num_rows: 16000\n",
"})"
]
},
"metadata": {},
"execution_count": 5
}
],
"source": [
"train_ds = emotions[\"train\"]\n",
"train_ds"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xFuE2sAdYHZC"
},
"source": [
"which returns an instance of the `Dataset` class. The `Dataset` object is one of the core data structures in image:images/logo.png[hf,13,13] Datasets, and we'll be exploring many of its features throughout the course of this book. For starters, it behaves like an ordinary Python array or list, so we can query its length:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "oTWkkxlFYHZD",
"outputId": "dfa107ba-d858-4cab-9b48-93dd55551ca8",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"16000"
]
},
"metadata": {},
"execution_count": 6
}
],
"source": [
"len(train_ds)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8VzPerykYHZD"
},
"source": [
"or access a single example by its index:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "4gzu4zNSYHZD",
"outputId": "8a98d82c-6284-4b07-b2a1-824c362abb69",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'text': 'i didnt feel humiliated', 'label': 0}"
]
},
"metadata": {},
"execution_count": 7
}
],
"source": [
"train_ds[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EALqyOFIYHZD"
},
"source": [
"Here we see that a single row is represented as a dictionary, where the keys correspond to the column names:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "PTm4nLwTYHZD",
"outputId": "3643f95d-edfe-4e2c-b64e-974cefdab77c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['text', 'label']"
]
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"train_ds.column_names"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "93NQ3IKwYHZD"
},
"source": [
"and the values are the tweet and the emotion. This reflects the fact that image:images/logo.png[hf,13,13] Datasets is based on [_Apache Arrow_](https://arrow.apache.org/), which defines a typed columnar format that is more memory efficient than native Python. We can see what data types are being used under the hood by accessing the `features` attribute of a `Dataset` object:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "-fbWjAY_YHZE",
"outputId": "cc0d4fa9-7d5a-4386-a9c6-1e097afae2ff",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness',\n",
"'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}\n"
]
}
],
"source": [
"print(train_ds.features)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O-p-lNtBYHZE"
},
"source": [
"In this case, the data type of the `text` column is `string`, while the `label` column is a special `ClassLabel` object that contains information about the class names and their mapping to integers. We can also access several rows with a slice:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "_oBXSTMOYHZE",
"outputId": "5de0b59f-88d8-47c1-b6ad-afbeffbf2491",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so\n",
"damned hopeful just from being around someone who cares and is awake', 'im\n",
"grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic\n",
"about the fireplace i will know that it is still on the property', 'i am feeling\n",
"grouchy'], 'label': [0, 0, 3, 2, 3]}\n"
]
}
],
"source": [
"print(train_ds[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4ZzYe4fzYHZE"
},
"source": [
"Note that in this case, the dictionary values are now lists instead of individual elements. We can also get the full column by name:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "-fTQBg9AYHZE",
"outputId": "f12b9914-f39a-434e-8ba5-c6c56f952310",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned\n",
"hopeful just from being around someone who cares and is awake', 'im grabbing a\n",
"minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the\n",
"fireplace i will know that it is still on the property', 'i am feeling grouchy']\n"
]
}
],
"source": [
"print(train_ds[\"text\"][:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lEJMH1O-YHZE"
},
"source": [
"Now that we've seen how to load and inspect data with image:images/logo.png[hf,13,13] Datasets, let's do a few sanity checks about the content of our tweets."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gENhOnKaYHZZ"
},
"source": [
"### From Datasets to DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pLLrPjf3YHZa"
},
"source": [
"Although image:images/logo.png[hf,13,13] Datasets provides a lot of low-level functionality to slice and dice our data, it is often convenient to convert a `Dataset` object to a Pandas `DataFrame` so we can access high-level APIs for data visualization. To enable the conversion, image:images/logo.png[hf,13,13] Datasets provides a `set_format()` method that allows us to change the _output format_ of the `Dataset`. Note that this does not change the underlying _data format_ (which is an Arrow table), and you can switch to another format later if needed:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "Mm2nTThxYHZa",
"outputId": "d0fb6785-4f5f-4e1b-d503-f68e162ad107",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" text label\n",
"0 i didnt feel humiliated 0\n",
"1 i can go from feeling so hopeless to so damned... 0\n",
"2 im grabbing a minute to post i feel greedy wrong 3\n",
"3 i am ever feeling nostalgic about the fireplac... 2\n",
"4 i am feeling grouchy 3"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
text
\n",
"
label
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
i didnt feel humiliated
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
i can go from feeling so hopeless to so damned...
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
im grabbing a minute to post i feel greedy wrong
\n",
"
3
\n",
"
\n",
"
\n",
"
3
\n",
"
i am ever feeling nostalgic about the fireplac...
\n",
"
2
\n",
"
\n",
"
\n",
"
4
\n",
"
i am feeling grouchy
\n",
"
3
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df",
"summary": "{\n \"name\": \"df\",\n \"rows\": 16000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 15969,\n \"samples\": [\n \"i feel rather imbicilic or at least complacent\",\n \"i was in the bathroom i had sat down to pee it was to make me feel submissive again per instructions\",\n \"i am thrilled with the way my skin and hair feel if you are like me you are skeptical\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 5,\n \"num_unique_values\": 6,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 12
}
],
"source": [
"import pandas as pd\n",
"\n",
"emotions.set_format(type=\"pandas\")\n",
"df = emotions[\"train\"][:]\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YHm8qfzZYHZa"
},
"source": [
"As you can see, the column headers have been preserved and the first few rows match our previous views of the data. However, the labels are represented as integers, so let's use the `int2str()` method of the `label` feature to create a new column in our `DataFrame` with the corresponding label names:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "LiqggeJNYHZa",
"outputId": "581f6acb-381c-45fd-f5e0-cc76cb01e0af",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" text label label_name\n",
"0 i didnt feel humiliated 0 sadness\n",
"1 i can go from feeling so hopeless to so damned... 0 sadness\n",
"2 im grabbing a minute to post i feel greedy wrong 3 anger\n",
"3 i am ever feeling nostalgic about the fireplac... 2 love\n",
"4 i am feeling grouchy 3 anger"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
text
\n",
"
label
\n",
"
label_name
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
i didnt feel humiliated
\n",
"
0
\n",
"
sadness
\n",
"
\n",
"
\n",
"
1
\n",
"
i can go from feeling so hopeless to so damned...
\n",
"
0
\n",
"
sadness
\n",
"
\n",
"
\n",
"
2
\n",
"
im grabbing a minute to post i feel greedy wrong
\n",
"
3
\n",
"
anger
\n",
"
\n",
"
\n",
"
3
\n",
"
i am ever feeling nostalgic about the fireplac...
\n",
"
2
\n",
"
love
\n",
"
\n",
"
\n",
"
4
\n",
"
i am feeling grouchy
\n",
"
3
\n",
"
anger
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df",
"summary": "{\n \"name\": \"df\",\n \"rows\": 16000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 15969,\n \"samples\": [\n \"i feel rather imbicilic or at least complacent\",\n \"i was in the bathroom i had sat down to pee it was to make me feel submissive again per instructions\",\n \"i am thrilled with the way my skin and hair feel if you are like me you are skeptical\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 5,\n \"num_unique_values\": 6,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label_name\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"sadness\",\n \"anger\",\n \"joy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 13
}
],
"source": [
"def label_int2str(row):\n",
" return emotions[\"train\"].features[\"label\"].int2str(row)\n",
"\n",
"df[\"label_name\"] = df[\"label\"].apply(label_int2str)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wBiHQE4FYHZa"
},
"source": [
"Before diving into building a classifier, let's take a closer look at the dataset. As Andrej Karpathy notes in his famous blog post [\"A Recipe for Training Neural Networks\"](https://karpathy.github.io/2019/04/25/recipe), becoming \"one with the data\" is an essential step for training great models!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FlP6l7PkYHZa"
},
"source": [
"### Looking at the Class Distribution"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mJYJZdMqYHZa"
},
"source": [
"Whenever you are working on text classification problems, it is a good idea to examine the distribution of examples across the classes. A dataset with a skewed class distribution might require a different treatment in terms of the training loss and evaluation metrics than a balanced one.\n",
"\n",
"With Pandas and Matplotlib, we can quickly visualize the class distribution as follows:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "yr8LXctKYHZb",
"outputId": "94328eb4-bb08-4137-9ac9-ec936bc73a16",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 382
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/svg+xml": "\n\n\n",
"application/pdf": "JVBERi0xLjQKJazcIKu6CjEgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSID4+CmVuZG9iago4IDAgb2JqCjw8IC9Gb250IDMgMCBSIC9YT2JqZWN0IDcgMCBSIC9FeHRHU3RhdGUgNCAwIFIgL1BhdHRlcm4gNSAwIFIKL1NoYWRpbmcgNiAwIFIgL1Byb2NTZXQgWyAvUERGIC9UZXh0IC9JbWFnZUIgL0ltYWdlQyAvSW1hZ2VJIF0gPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9QYWdlIC9QYXJlbnQgMiAwIFIgL1Jlc291cmNlcyA4IDAgUgovTWVkaWFCb3ggWyAwIDAgNDIwLjg4NzUgMjcxLjQyNTYyNSBdIC9Db250ZW50cyA5IDAgUiAvQW5ub3RzIDEwIDAgUiA+PgplbmRvYmoKOSAwIG9iago8PCAvTGVuZ3RoIDEyIDAgUiAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeJy9V01vEzEQ9dm/wkc4MPHM+GuPbYFK3EojcUAIhTQNrbZAG0HVf8/YSXa9kN1woYkcrd/afs/jmfFk9nr162a5en9+as4u9azvLTcaza20tbHmVtqjQXMuba2t9O60IwspRS+dtu9QRHDkA3lB7bD7VetrPTuRJTYy6VzHtJvjIckAWRIZQoW0FeIi+JShblaHlHXvzR/LMTtIhgghBvOwMh/MN2G14BwjR/KukU7cPlkyD7JPPfG242UPNmS1iATyDvdI2yOewbtabAcUrRfmv6vtmCOBLcZFH0CGU3R7rK2xxoK3teQOeHbJaBM0qWhuHNjIbHuwHYAUwYVadY88v2znoXFZNtkIYtMoBHuwHYDB5aioZXfI88tO2WmzbBZvkEYudWBbg2QReBiDHfLssgkbSCUOuYkQnQuOO7AdgJyAB9HYI53s2Qlt09JaUpykOxH9mJNURv5OVB1AwOXU9Klkykd9L7/WvLKyRsxJIWKJJIhmeadP53r2FsVDzfy65ND5lf5oXij70nwy83f6zVxf6EKsMQv07JPrOWtsnBaJJThIzG4Tp2PUqGz+HhAg5kvYhJAqARU2IUA8x3NymDAc3zuNCSAv0ZJ89NQLqLFxAeQku1nnLZFHPiaAxwQwBogeE4deQI2NC2AJChetT4HZ4zEBblRADEAYm8oAFTRBHyw0HEK0cgZHPcCP0e9d3Dngkh3EobkCDppebjnfcL4khTbnswnmjfqpHtQPaTfyvBpXkBAo1Ap2wCEFTg4IS7khpop2WkGrvqtfU8yICYhq6j1ykDvKBYWUB0nREqa5r9VKLdTDBLf3gGnAvUMOHzo0oewbvYVA0+QL9U2tRcAUfUOAbkC/Q0YOHj2XvadclR07+YW6EgkredqMSyBqAAeut0cOSfBOXmUDEDmptaYV3MrJP9XM9zqv8SqvJjGeQwutbDdsvXnShxbqi+ykVZ9lRwt1N/QnMu+25XO5T4bl9fit0lW2+vKvmvjuUE0s4/6hmK5H7aaOrWaL7u1diOUmXA+SfJTiMDkuq3OZVtlISpnKSsVMb18a+ScQxH8S549A4ny5KrByr+VPo1+IGe8lKazEkEv1pIwc0rX8nhUjb3KSKO6i98a90L8B6J2IiAplbmRzdHJlYW0KZW5kb2JqCjEyIDAgb2JqCjg1MwplbmRvYmoKMTAgMCBvYmoKWyBdCmVuZG9iagoxNyAwIG9iago8PCAvTGVuZ3RoMSAxMDc2MCAvTGVuZ3RoIDczMTQgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnic1XoLeBRVtu7ataqq393Vne48O0nnnfBK7BggEKRFCE8xQECCgmlIQkAggQAKgQmiJKIoUSAIAxgVFAIyARESjAhjRkBkZjyCx7nq9QGKngnIzImvEHbOquoOBOace+d+93zf/W5VdtV+77X+tfZaa3cFGAA46CGCZ+TwEbkwEEYAsD5U6x6Zd9+kpGd7n6LyKAAhfOSkycM2Ne1uBsBCam+/9+78UZZPBvwrDc6nPu77JqV75/xa9gSAVE7tU2bN95cLJaZYKqtzmGctXeyBOdHZAPI5KvOS8tnzF965dC6Ansqwd7a/ohx0dINhFZXNs+ctK4HfN79H5XUAYRNLi/1F+um/rwbwfEnt/UupwvKi7nmAuHAqJ5bOX/zoa5G2D6icQ+W6eWWz/LO2T8+lsjr/+Pn+R8vFDfJCgPhEKnsW+OcXp1we0kLl4UTPufKyisVftfygp6nU+arLFxWXD9b9TZ26nngoBRUrMwQugW4ED9XFQRLl1TYj9IMcEIbnjssH6zz/4gUQTrjS1dUFcCOn9mQPFy9aAHo1pyWRZlDfehBYvtqTzWXbwAAO+KevruHas1Z7Pqw9j6hvtSbQ9n9/dX323zNPjxk/6vqs66PumdX8f/8ahKuBZGMCO+EZDhHghlhNciYI1fDvvlSJ3rxEkLS3rD11JBtGs4TRU5UVBnvIWosBVOnraUZVPySwgBVsoNCKAb3YAfvAqenFcv8i/0yo9S+avwBqZy7yz4HaWf4FFfQsLV5Ez2WL5kHt7OIyys9eVPww1Jb6F1Cf0uKZVPOwf4Efauf5yzzqk/Trifn+xaVQu+BhtaZstn8+1C5asoB6Li5ZMJueper8/4UOasjMmzPbf4seihDQQwb9tbdEHDkJsXhIJT1HQjAUErW3ExIIiVBqESgfR08HjWDaU+X5S/gO8iDHDLrPNXBX0/KFBM75ILyFtwqpu8xKg/nJ/4RgrTTv0v99P5gW6Ku+b+R7zHFL/bQe5UU388JAgGvrKZ9zY2gqYaNoeIUG8WNUIwZbBU0XUsVUUK0GPSlHPUQrq1U1S8qUttCQmMAb/xVKBNrtgklG1IuCEJBDjyuvZEQR+Ehvl8lO7mRbdfPZhVv6YDC5QUUf4AkqMa0sAmFKNCpUI9MMWXAP5MJDMBvmQgUshWWabnggI1jvhzkwD5ao9V0Xus51nez6Q9frXfu69nY1dO3peu1WunpcAdv4ZLBk1OYMJJWuDEpZENDBe4JJ7ZMbTCZKDwWTOpOf0mxKFkpzKJGfID0DogyIapUfIBoBVPnHUFoWTB6WBU1whu4T0ADb2KtUKqH6hVRTLxyENTSqCd5lZ9haoS/VvQpX4SPqWQNnsEEENgYyqRbgL5IA7WSND9Ec2czJsnUybY/x4iFxotgkXhLPwgCxQjwrFooVLBNflqZIr1LKxj+QHE+ThWliXxCdR/F7zMQWcbhohS/wLDbAN7SKisEZWA87oZJocbIyqBIqhYlUc1I6C1vpLqP2s2wH+4ioO8oeh/PwAorCKNjBzhNfZ+AneBzzhSqCNlMoIfpP0lxnafxWqCCzdJ4ZgQu9qY6op7Vmas9o7Cud1+6rUEUr58NOuUl26hJoFRWxV9m7rE3eAPXwET6IC/FTtkZMEHeLo2B9AAEshPU091Z1jFzClhHv6l2pzi48IhayBvheLNTNpLn/oHJEax4SJhJHJdBC6RFZIZ4GszW4lihVW6PhrG6MmE7jaQbdSuIaoAyzSNZl1L4fDkJfrIP1NJPGrzxA+olGbhO/Ip7Xs2eEn+AsDoc0KBGvqDpB6lEHcEQnSyIKDPp4lEYhaXRRo2/CVM+pgri+fW4rehSdpxHyGi3LPE1dXXlTxSipoFFyN2KSvlFMSvjqv2r8qm+fsXlTPY3XRwwPzjqicDjVTZpKWbVE1VQ/YrjWpi7aKCXR3+jCRs+sUs9TylMJg55Sigf1DdgJ8vq0A5FyNV0XxPUkHRP5mQRfiFzvgHrzc4514Qa3LQbdrqhwpbOtve0OUC62tylXMli8YFccmV6HXRFSvGBXICFefQpPb9u+nf62b7/GDPzna9f4z8wg5fGz/ANKZ1km3XeyzHpewat5Da9gz7BlbDl7RrUnX9EWnUZW2gg+n2sY1otCvfSYDuoN+ljZjRDLTMq5sY22/KnN1Nk3sKCttZMISm/ztreda8sgDAri2SEb2kRh+oA4u5SVlGmPc8VxNoZvYcXvszGdOxvEilFNozrON9AEJC9xDHHshh2+lIjIKAx32yUR7JIkDlNesm+01DufE2nfgmIUmNEdpqAcrXSObXTlj20MzX9gbKMz/wGiBLuODyxoPdd2/LjdkR2kpl2jRqdIl3XSZdboVuxh2USbzztZnCJN0S0Xl0tLo2oidLSrI8RIEq97MSyVl0RWRC12r4bqiNWRq6NWu3fD7ij7dJieRExk9YcBd7GsO5MT4mVd1l0s0yu6nLJOBjIlJzrHEYyZ/ntfq37oo0eXn5v6HXOOeCCCtzc0NDzCnhs0f/PoR+qG3fPBHd7vfv/grvJo/lfifhvJu4K4T4VyXz9whRirDbHVnpB6l6XesEF213s2JDwnr3O9khbqDgF0RriTPYobnbEGOU0FITS/m3+Dxj8B0N5GXBICStvF9ottyrdXFO0mVDKYz1AU44/1e4riRJjOYpjLKcbFJ6dkxRAj/Ymr3iwrkLmFPRz63Cv8z/y7GSfn5p+af+xk8679hzfteOWFSccWVZwu+JaZn8Wk2Nbaz/+elPTuHd669U9sevWR8orKxORDHs+HB1fsVTW8iKS8k3RKIGv+mC+aWdACiJZhgCZdvcTwMQMzG8Et60WzVflsbKOJGLNojJlVxs7ltLZ57apcL57LafMSL5pgxdMk3NOqSHuZoBeMggJyFI/AU6ALZb0hmfXG/mw8u898n2UKK2FL2HJcwywkSgOLw0x7pivBnmCPy0KZC4xn8fPnT1+fISV1XsCznZm7eT0rfJcktIMkVESUR8MMX4IYqbNXK9GR9TpnvbLWItTDY5Z1up0xYW5mRDcYFTlG6WQ95aKo5Ad3i6LuFhKR0npF3cDqDibx8NaAdEJIv+wq5uCiaKqnWFRpfI4R1+v7TO3TwRL5Of7DjHdLpx1/+PX33399wkv50vkG/rzNxq/829/4jx7PmTsyDm/bdjgxmaiv6LogpZB+RUB/X6TlJet+4yY7ewn2i5vCnrOvi9RFWCDDqUSSUfEGjYpqVn66knHIFhUbJbDpzNWPJXhUkxLn7T/AZb1RCJVSSi6t7gJ+lSkMVl8qmXv5Cf46X86q2aTqy9LM8w/N4Cf5J/wv/OSMhz4aNYq9yGazUvbiSKLqNGlDAmFqgDSfQ94kCpvgMfF1vcR0BKJoVKk516puYs3GHbSZGJuuwaOmhNM47Xq5kHe98X2V8VEN1weQ1epq49Gik+8jK2p7E/ZQjC4qrTQHWSRG+1Z0XvuE71u/XtXFSpJoX/JZRopnW8jmxJrCDFbYEyY3W+2e6tij7uaEJvu6MDOEYbjFoDfFot45IpmI+uBcm5e0UFXD1ovtnUTce+rOyrZnk/h8CzKiM2IyYjM8GXEZ8UNTfNG+GF+sz+OL88XnRefF5MXmefLi8uLzUspT1kTXxNTE1nhq4tbE16bUp1xNieke2j2oe0BhTGFsoacwrjymPLbcUx63KmZV7CrPqrjw6SSbeNnlDCUdGcIG2BOySDbxyVl39s+M67mDQ4VjX+x7rGxLc1PT0JYn9525fo0Jr20uPJxffGzav18VMksqZ1b85VDauOuPNZT4T7z89nFH1dP9+jWkpHSqWB0lrHbKTvJHbhjoi8Bms83QHO5aZ2uK2hwBDsfIcLOsj8yNVkXmbdfcwMX2VhWYjMOFMati6mOQ6NTICZDKNA0nb0W0ptAGzMRvXnv++dfUdP3ZQQcqP6DTyQeVBwY1NwvpZy5dOkNJmFjk5y38F7pb/EW7iRoGC7su4CWSYQQM9UVBNXtStFZbnjQ228XmsCZVsR0WGOUcQYp9sVuxFd5+RfnxSobPZItSolZF1UbVR0lBBe+mTlPw+KCC46Xx2/PeeO+9N/K2j7931/Tr/GPWl8mTXxaz9vXufeHs2Qu9ezckJhJDVuZgg9SzEFElTiP6lABakc1gdTZL+nXWJrYZw0TQCyPtDtOIaM2He7030Gq9BS17ZlCY5NqBBMg00gIyxZebmgYdWHGmC7rOrDhw/SThtns3YYeHhRm/tu0u8rPhTE/3cD93BeEL0lVFaDkhivxMIriYoVr/pOTaw6RmM3srvNnRZF7njnIJepcexgoO2wi3RmKr5ktV8C5qXqRdM1S+tKHR5dH10X+OvhotDYWhbKgw1DU0SuqjS9enG/oYy6CMlQllrrIow/SFKsBxmiG7aTxIAXQa6DqxqvOg+eyRuSdnzvrzw7ydn2RpnV8zXZOw68mtzVZhxrRjJ++8c3+vPmwgM7IQdg//vHXzof071CglndTzF8I6BAp8bklhZv0emdXAZqvcYhRCdKAzSHqLzTTOqfpJo2p+Tar5Hdto1fKqWc5p7cxpbXVoG/qit5NsDXkWcpOHfa48V70LiXQiMpppVodkkqluLuGXxln3snT+YXNj4/63ZeeWvNJZ6zvT8cP149/aq2LNp4jTCGsT+fQxvoQIc7TBUR0S2mzD5uSEppQWQ7Pt7cjo5AjQm0fKDodnBDny1m51aL0YUAh+XkU6m7Si16pe9b1u20NhinDTPwxhQVWhKDA0LIsOIrs2bdy1a+OmXU2cd/j3TZiwY+Kbh7IPrvhjZ+cfVxzMbhKGnPrss1MnP/vsr/xr/n10zBt9er39zgOzZrJBDJnIBs2c1aDu/BMEMp00yZ7qoLfPKh8TD0CLIDG9CLl6pZO8sIpaZxvtJ8XgM+QZCg3lBtpPIaS9qms90USXWHitXnZ+T/N1fcqnaPOZwAbDfW6ToAPrMbOuRnobWswHFL0iyfdZmN4MuYo2+8VsRzCKaVMCYNBCdp89z15oL7cHFnLKwYglsOArb+beMWectuq6j49v82+RU78nTbnJSfwR2CwwPeSKSiBkzfBZFMkn5UmFUrl0VZID5BPpsvPXNlXLjtKpJJqkGQ/TfMmywxBuAzla5zLXRHuwKaolQtGB3abXy3l2vS3PHU6mMEE1hZ2dnW2BODQn52K7Frw4woiDkIzEvMTyxNrEerrfSfwisSvRQLLV7LSLcAsIuWcm06U1imkjjq/+3bHmRUvWv9q86JFnXm1uHtq4bPleXLti6Y9fX39Q2PHStmM7r9cIO17+7TuvXK8RC/fPnrkiyIFYRByEUAQQgQZAK5NrrPYmc4uRCXoYr3q4XKdKtGa+c1T904g9VOj6k0uNAP4zcoqaVqzYtK+5edgbS068J+xUCXhxh0oALVxc9EPQ5izR9kEY7YMQudkBzeYm9STjsE1Ah2vEbScZX8LQiEqolKt0VfoqQ5WxylRprrJUWatsVUqVvdJRH3E1wt7D55E9vOXAU7Fx395NG/bt23CVOfiVq3/jPzA7fnHp9OlL3506+f02foq38ctkYLLJjjjZQNW30U7dSRSq1vouX1S3tW6yrmNvY0s0WeqRms3u4d0oaOs22D5DwGJ/GSOy6Uk3oAm6tltcXkVz803PJgzs9ne7r++XjQ09fBv7a7fJvsWOaNR1e94m27qotyNaojW/O5I8cA9v0k3de7dR19OBsB6OhayFPYGld/sQoeKmZxnU1HTD/17f38OtFDX8+lNAq3AMUWeHDJ9TNtEuMGGNtcnQojPKetDnOlSjplkI8iLnPlDdxqG8kBdDVH0K+NubyhSGY2JH99n2GqF0dE1IPzcectjPHLt+kFSpZJYk0Wpl5O1P0mopcMmXYzELVtOk2Bi9QdAZJ8XGxgwzmmJiRRdFAWtFZ7VrbbgaBSRRFJAaYzTFRulgYpTeqtM740ekqlSda7uompPs7O6w4Ec1LFA1XjtTWC/T+UmnPQviD0IKhZ6++W6j2+Q29yPn1sfUxzzYMNg42DTYbPKAhyUKqcZUU6+QdGe6q1doakxqbJonLS4xpdpYbao2V1vUX/CZIMhG2YRmtKAVbahgBEZiFLrFaENKetrQtIfSqtJWpdWm1addTQunA8rCm1FJrHZOkxN6HgjSCUPVF1GM8vT43dPWrp25cWjrrp8/mfbuvJL3/KvXFe/17X3hyz+WHBKH7k9Nzc/3jY6z9tqydtvhhIRjWVkFE8bmJdkSN63esS9G1bT9tFOnahbCCYN97ps2Yp2RtTibzGQhnKbxZCtyXaqqZQfketF7w1CUuY6rhiKEApeApt2IYJLZftVQvN7UdM+BJSdOsT+xo8Kr1/0vvnhsp1B5rX5fyayruFu1UkPISlWJhSDDNV8K2kVJFOxMkNQXCjLIzA4gDxMQ3pFkCQUmiaBTf4XQHDsEHLszX/1VQD10gXYYDgv8DpB940cARR9M0mX1zPjsKGGuUClUCdXCKuE5YaegVxcyoIF0ycUiMVJMpjNkGqaJHn0WZLFBOEjM0OdCLhuNo8VcaZTs00+BKawAC8Q8fQmUsDk4R5wtlcqF+iWwmFVipbhEWi6vgTVsLa4V10rVch3Usc3CVnxBfEHaLO+WXpMb9cf1X+i79HeR2EMyDSyTJQx5l81gM97lD3aIhZ35uO9avYoQ2QMVIRt72nePTi8Y7GAz2k1GAJvVbgObxW62gPqyWowmo9luMhmHWUwGBUxSDb5tNbUoVovZaJAR9DbRZlIC6I1t1GuImboh7I6NWlvtYdoOIbsXPOH8JzBqb+lymFfF86oMkl42oCXUGGZRLAmWLMto433G8ZZphmnGucYayyrLBovDCESESTKbrCZbGHMJiqhIYUanyWmOtEbaUiCRdpRH9Ehp+lRDkjHRlGhOsfSy9rJ57ANIBllChpghDTT2N/U3D7RkW7NtGfa7wcd8gg99ok/yyT6dTz/MMMI40jLaOtrms+fDBDZBmIx5Yp40RZ6sm6K/33C/cbJpsrnAWmDLs5ewEqHUOMc6x1Zor9Q/an3UthaeMqwxrTGvtay1rrVtMWwybTJvtW617TTtNO+17rU12v9k/8LeZS8miUlWFghBhjKmCk/YMH7jig3zxuVnxvHBATGWnlq+dVR1vji+cyPOC0auUgPFI4kwyheSrAWq5rhwS4zebo5TnOOSKPxp9aqhqZKjxqd0iPXZDRb7HocQWQPhm+VYR4vJlp7zrdfLc654KWj1ZtwSqN4MVger9Tq1QbUWUkN35Mp1WvC6v3FWSjL79ZYotjuS3ZKaWjorENGmq7/QEb0uOm1N9NmjciFMH2pzino9hhrlcZE36eU55CN9Dj3uAaXGGn4s9IB1swFaJKZSe4VrP0R5KY7roqNXLR3BFO0A9o8kE8WMIjpxTIDS199oVin/tblZjSe7aTzyO5VodvD7IKZBGgf4bGG5FKUazXq9Ijqs40JV+gLkqdTRmWCPQaSDgt3QYhFUwrhGFdOguz3cF8YI0MyP3Iz41cBQqrwt5qfV5U5aPQ36UpwYnp4b1lvfS4ly6SN7GSBW1ifGGOKTx/W7CVSrV312anCFRcUm7Em009ml77FeBxTYHKpLbImIjlNDR69X3X9Km5f+AlIOSnNA/wE3oOqWeY9DikToqQcVVbz3u1PGP06ivlc4qMIZlD0SmiT2gJQnu0NSVDC7we1mTejGVuMuGcb7QlNzLXolNNypVwzqz3VxUYbYhHEpPTjTGNPUINzt2RNnF2rMyZtdurgWW2RMgKX2nH/kp3/mbfDfduIK6GlPWQT5uMHD3p5yuSGbwPcxoWBLS+W37Q/Zcn6EWL32cerD31gvdr9//rhznLXAoH531t/4lkXjdPN5NICV//xxxwRrwT989UoSz2rflUBQD0xP064OgxpKX1Gqo7SNUhGlHVIdVIhb4bQIXW3iJaiUnHBULIGF9F4otsFC4UNIV/NCNpwQsrs+Vd+6M3BUclC/b7R+R9V2HEPl3lBG9fvFFhiiJulxGmsMJF0LySr9FhqHwTL4hc1jp4VUoULYLhzEUbgRvxB7i5vEr6QR0nypQTolp8nF8mGdpMvXPa0fpF9hsBj+bhxj3Gb81DTGtCeIYTLmQ28oBTPpuwJbVKRElxBKb/Vblg6mqV8bRQN1ztC+96l5BqFUCuQF0LPcYB571Is98hKEs/HBvAxOVgL3QBmUEw+LYA7MptUXg4fO1bNop3nACxl0Z1JuJvXwEK9zqL2C0iIoBj/Mhz5UOxoWUP9+lLsb5tHtgYk35qrQSsX0LqYxS+lZRD2N/8Sq/W+smk8rLaW11C9WC6i3SoefxvyfrTiccnNp3BRYQj1mUV+/NluxNsKvceShWRbQs5z6zKR551A/D40vo9X9Wtvt80zSZqkgisrofphq1VUrqG+ZNpOX1s6ErFtGdY8RAgrU9Rvt+/c/XsmaXqj/MaHGjS4IpXNeONWm0WrDYQTkwkgYReiPgXFwL9wHeTCBOJ9EeE2mdaZCATzIBDgOJ9RfHXRLFszJHZaREXxnBt93AjQJq3xd1zh2OPHXJPzFiz/X4U9W/JFjO8d/T8K/W/FvdXg1CX946m7pB45X6vByHbZ14F878N84fj8IvxuGlzh+68VvLk6SvqnDi9Tx4iS88HW6dKEDv07Hrzh+yfELL/5PJ35eh59x/NSB/2Ml/uUt/ITjx9T945V4/txI6fxKPDcSP/qXKOkjjv8ShR9y/DPHP3H8I8ezdfjBmRjpA45nYvB9L57m+N4au/SeG/8Qiq0c3+X4e44nOB7n+A7HYxzf5tjC8S2OR+3YXJ0kNXNsOvKW1MTxyOHp0pG38Mgq8fCbSdLh6b4uPOwT30zCQxzfqMODHA9wbOT4O477i/B1K+7bmyTtK8K9DQ5pbxI2OHAPEb2nA3dzfI3jqxx3OXAnx1detkqvePFlK75UhPXUpb4OX+S4Y7tZ2sFxuxm3/TZC2laEv92qSL+NwK0KbjHiCxw311mkzRzrLLiJBm2qw40brNLGVNxgxec78Lnat6TnONauny7VvoW1q8T1zyZJ66fjep/4bBI+w3Hd0/2kdRyf7odPEZtP3Y1rnzRJa534JB0ZqaKmCKsJqeokXGPHJzg+vtouPc5xtR0f47iKYxVHX9dvVq6UfsNx5UpcUYSV+S6pMgmXc1zG8VErPmLGpUZcwnFxB1Z04KIOXNiB5RzLOC7gOC8OH+Y41z5MmjsJ53AsXYmzqVDCsZhjEcdZHGdy9A/Cwg6cYcbpHB/gOI1jwVSjVNCBU414f2iEdL8Xp3CcTCtPHob5LpzEFGlSOE504oQxIdIEjnkmvI/j+HsVaTzHexUcx3EstYzlOGa0Io0JwdHRFmm0gqMsOJJjbh2OqMPhHO8R+kr3dOCwt/DusejjOJTjXUMc0l1OHJJjk4Y4MGewRcrxddlwsAUHcczmOHCAUxrYgQP6K9IAJ/bPMkn9Fcwy4Z0xmGlB7x0mycvxDhNmpJukDAumm7BfX4PUT8G+Buzjxd69kqTeRdgrzSH1SsI0B6amJEmpd2NKEiYnmaRkGyaZMJFjAsd4G8YRn3EO9BRhbAfGEAsxRRhtQTch6OYY1YGRwzCCChEcw4swjJAK4xhKg0Ij0MXRyTGEo4M6ODjaiVf7MFRWoq0IrRwt5lDJwtFMvc2haOJoVNDAUU/d9Bx1TpSLUKRGkTTAhVSLnI5+iiT0RaYgcGRNrGjNM6z3/w8X/L8m4H95Rf8H1V2r7AplbmRzdHJlYW0KZW5kb2JqCjE2IDAgb2JqCjw8IC9MZW5ndGggNzcgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnicpY0JCoAwEAMHqvU+auttq/9/pUsfoIiBBJLsEvgJ9dInpGgycoqHqzJq9Xm9Fja0dPQYBqx4xyg6MbOwsrFz4CUJnPHnugFNEQIOCmVuZHN0cmVhbQplbmRvYmoKMTkgMCBvYmoKPDwgL0xlbmd0aCAzMzQgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnicXVLLboMwELzzFXtMDxGBJEaVEFKVXjj0odKeohyIvURIxViGHPj72h4gUpFgNLM7u2vW8al8LXU7Uvxpe1nxSE2rleWhv1vJdOVbq6MkJdXKcWbhK7vaRLEzV9Mwclfqpo/ynOIvFxxGO9HmRfVXfoqIKP6wim2rb7T5OVWQqrsxv9yxHmkXFQUpbly5t9q81x1THMzbUrl4O05bZ3tkfE+GKQ08wUiyVzyYWrKt9Y2jfOeegvLGPUXEWv2LJzvYrs2an/p8wBl48fIe8v4IeaEJIAXsAYclNTgPUD2cgZAFZDHLAvKxCbKHMzDIAo1EBnmh6CvQQaCvOALE4kCBZ9B6LrBSRGWg2TzjTIUCMACTZTh5hgmydDGGOhl6Z/OhHhRRNPVwBl78Tpaf79fj79K6e3m31q09XLiwb7/pVvN6J01vvMu/f4I9vFkKZW5kc3RyZWFtCmVuZG9iagoxNCAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvQ0lERm9udFR5cGUyIC9CYXNlRm9udCAvQk1RUURWK0RlamFWdVNhbnMKL0NJRFN5c3RlbUluZm8gPDwgL1JlZ2lzdHJ5IChBZG9iZSkgL09yZGVyaW5nIChJZGVudGl0eSkgL1N1cHBsZW1lbnQgMCA+PgovRm9udERlc2NyaXB0b3IgMTMgMCBSIC9XIDE4IDAgUiAvQ0lEVG9HSURNYXAgMTYgMCBSID4+CmVuZG9iagoxNSAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTAgL0Jhc2VGb250IC9CTVFRRFYrRGVqYVZ1U2FucwovRW5jb2RpbmcgL0lkZW50aXR5LUggL0Rlc2NlbmRhbnRGb250cyBbIDE0IDAgUiBdIC9Ub1VuaWNvZGUgMTkgMCBSID4+CmVuZG9iagoxMyAwIG9iago8PCAvVHlwZSAvRm9udERlc2NyaXB0b3IgL0ZvbnROYW1lIC9CTVFRRFYrRGVqYVZ1U2FucyAvRmxhZ3MgMzIKL0ZvbnRCQm94IFsgLTEwMjEgLTQ2MyAxNzk0IDEyMzMgXSAvQXNjZW50IDkyOSAvRGVzY2VudCAtMjM2IC9DYXBIZWlnaHQgMAovWEhlaWdodCAwIC9JdGFsaWNBbmdsZSAwIC9TdGVtViAwIC9Gb250RmlsZTIgMTcgMCBSIC9NYXhXaWR0aCA5NzQgPj4KZW5kb2JqCjE4IDAgb2JqClsgMzIgWyAzMTggXSA0OCBbIDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IF0gNjcgWyA2OTggXSA3MCBbIDU3NSBdIDk1ClsgNTAwIF0gOTcgWyA2MTMgNjM1IDU1MCA2MzUgNjE1IDM1MiA2MzUgXSAxMDUgWyAyNzggMjc4IF0gMTA4ClsgMjc4IDk3NCA2MzQgNjEyIDYzNSA2MzUgNDExIDUyMSBdIDExNyBbIDYzNCA1OTIgXSAxMjEgWyA1OTIgXSBdCmVuZG9iagozIDAgb2JqCjw8IC9GMSAxNSAwIFIgPj4KZW5kb2JqCjQgMCBvYmoKPDwgL0ExIDw8IC9UeXBlIC9FeHRHU3RhdGUgL0NBIDAgL2NhIDEgPj4KL0EyIDw8IC9UeXBlIC9FeHRHU3RhdGUgL0NBIDEgL2NhIDEgPj4gPj4KZW5kb2JqCjUgMCBvYmoKPDwgPj4KZW5kb2JqCjYgMCBvYmoKPDwgPj4KZW5kb2JqCjcgMCBvYmoKPDwgPj4KZW5kb2JqCjIgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9LaWRzIFsgMTEgMCBSIF0gL0NvdW50IDEgPj4KZW5kb2JqCjIwIDAgb2JqCjw8IC9DcmVhdG9yIChNYXRwbG90bGliIHYzLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZykKL1Byb2R1Y2VyIChNYXRwbG90bGliIHBkZiBiYWNrZW5kIHYzLjcuMSkgL0NyZWF0aW9uRGF0ZSAoRDoyMDI0MDUyOTIwMDY0NFopCj4+CmVuZG9iagp4cmVmCjAgMjEKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMDE2IDAwMDAwIG4gCjAwMDAwMTAyNjEgMDAwMDAgbiAKMDAwMDAxMDA2NyAwMDAwMCBuIAowMDAwMDEwMDk5IDAwMDAwIG4gCjAwMDAwMTAxOTggMDAwMDAgbiAKMDAwMDAxMDIxOSAwMDAwMCBuIAowMDAwMDEwMjQwIDAwMDAwIG4gCjAwMDAwMDAwNjUgMDAwMDAgbiAKMDAwMDAwMDM0MiAwMDAwMCBuIAowMDAwMDAxMjkwIDAwMDAwIG4gCjAwMDAwMDAyMDggMDAwMDAgbiAKMDAwMDAwMTI3MCAwMDAwMCBuIAowMDAwMDA5NjI5IDAwMDAwIG4gCjAwMDAwMDkyNjkgMDAwMDAgbiAKMDAwMDAwOTQ4MiAwMDAwMCBuIAowMDAwMDA4NzEzIDAwMDAwIG4gCjAwMDAwMDEzMTAgMDAwMDAgbiAKMDAwMDAwOTg1MyAwMDAwMCBuIAowMDAwMDA4ODYyIDAwMDAwIG4gCjAwMDAwMTAzMjEgMDAwMDAgbiAKdHJhaWxlcgo8PCAvU2l6ZSAyMSAvUm9vdCAxIDAgUiAvSW5mbyAyMCAwIFIgPj4Kc3RhcnR4cmVmCjEwNDcyCiUlRU9GCg==\n"
},
"metadata": {}
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"df[\"label_name\"].value_counts(ascending=True).plot.barh()\n",
"plt.title(\"Frequency of Classes\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nnqy0q0mYHZb"
},
"source": [
"In this case, we can see that the dataset is heavily imbalanced; the `joy` and `sadness` classes appear frequently, whereas `love` and `surprise` are about 5–10 times rarer. There are several ways to deal with imbalanced data, including:\n",
"\n",
"* Randomly oversample the minority class.\n",
"* Randomly undersample the majority class.\n",
"* Gather more labeled data from the underrepresented classes."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h-turUBWYHZb"
},
"source": [
"To keep things simple in this chapter, we'll work with the raw, unbalanced class frequencies. If you want to learn more about these sampling techniques, we recommend checking out the [Imbalanced-learn library](https://imbalanced-learn.org/stable/). Just make sure that you don't apply sampling methods _before_ creating your train/test splits, or you'll get plenty of leakage between them!\n",
"\n",
"Now that we've looked at the classes, let's take a look at the tweets themselves."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wrKe_b_KYHZb"
},
"source": [
"### How Long Are Our Tweets?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mJYrgUJ_YHZb"
},
"source": [
"Transformer models have a maximum input sequence length that is referred to as the _maximum context size_. For applications using DistilBERT, the maximum context size is 512 tokens, which amounts to a few paragraphs of text. As we'll see in the next section, a token is an atomic piece of text; for now, we'll treat a token as a single word. We can get a rough estimate of tweet lengths per emotion by looking at the distribution of words per tweet:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "1G7CjQ7BYHZb",
"outputId": "c3eed7aa-221f-42cc-c7c1-784f8dc54272",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 374
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/svg+xml": "\n\n\n",
"application/pdf": "JVBERi0xLjQKJazcIKu6CjEgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSID4+CmVuZG9iago4IDAgb2JqCjw8IC9Gb250IDMgMCBSIC9YT2JqZWN0IDcgMCBSIC9FeHRHU3RhdGUgNCAwIFIgL1BhdHRlcm4gNSAwIFIKL1NoYWRpbmcgNiAwIFIgL1Byb2NTZXQgWyAvUERGIC9UZXh0IC9JbWFnZUIgL0ltYWdlQyAvSW1hZ2VJIF0gPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9QYWdlIC9QYXJlbnQgMiAwIFIgL1Jlc291cmNlcyA4IDAgUgovTWVkaWFCb3ggWyAwIDAgMzgyLjI4MTI1IDI2NS42NjU2MjUgXSAvQ29udGVudHMgOSAwIFIgL0Fubm90cyAxMCAwIFIgPj4KZW5kb2JqCjkgMCBvYmoKPDwgL0xlbmd0aCAxMiAwIFIgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnicrVhLbxs3EN4zfwWP0aE0349j3YeBnOpGQA5BDoEtqzGkpLHaGP33HUq7nKH2EcOihU20n5bzfTNLzpBz9evm++e7zZ831/yXd+wK7+4OTPFHuLZc8ke4nrniN3BtmYS7PTNRCx2VdnC3I3faO+G98/B1B49Wt38x9sCufgYjBxh1w5hOwp5GORHhPzAbnJAU2lHIqh7DgQXqjeuT8S1IBdkignCgyghzcdA4sCGihTlaYdfg8zP7Bv9K/pMEKxaMaGnyU0kEfrdn12t29bviSvP1wzEa63v2gb/pPnVfum236Z5W/CNfv2W/rdktO0pgSgFlzU2geXIljUgniT8gfwDiT5PUwZwHmUAL1N4Kl6T/MfVj97X7b0ysjapf5Z5C88RaRWGTe0nEd8D8vdtMcMco/Bk3Qgvc4LRR9iVeHyDc9/DGN/DtMFZgrD+fbQSaV2DgN6WCiuEFCv7tnrq/4foM36eiUFZJENYr422OgxPpDJx8/VbEUxxAUEwxeuPzswt65IICH4UJ0jpHFSA4pSAIzb0RUYbkrMtPLXCrRXYlk9BHM5SeoHP8SloRVNaYP0sC9LIAJ4U6xrASgOisAAv5U+fXlD9LAsyygAQrL4WgQyUA0VkB0QtnnNYhf5YE2EUB2mghJSzAWM1BROcE6DxPrTUq5s+SAFcL+MbPy4vJwYT84vnThr/nX7jmb3ld9Jg1fVamMzboPnXQGVtAZUxfg8rgCaia7O/YLX+VvlKxqLwCGhifI6XbcAxe0EJZvbDXUwx1ABXvGWwVbEs3Sq2hU6+QNPKEKakmcpwycvSidgRVQBWMlUlluJiYgdvMHdxxVEoL2ibsaI84U213FPHxAh451HM6h5Q2o5nVhoXIpjSNvGHKpd4i5JVSrGEb0EcN0R1BMc/g+CmMjr4gGGXfWGksaKOQF3uYf+iOdfDuAgbvRwkzx1qOJlQblkEz5WjhB4Nfe3MBqnQu0iZXVzvkOER3BFUmiARzQ6YMFxMzMLXxeqm486+UFtQ0WUVojzhTHzuIjxfw6KGqG5IUtCl1uI03hYXIpjSNvGE62IldhZFDjqPlB1G4cJmgiRm4TQnDQ1x1oIjjUtCGhThTnSCVarIoohplTqBJbQsyshDZlKaRN8zAMoujNW68nMgkiGK5wvFTWJs8hGfwSmNB26xctIcljJz+22wZjEmj5Ak0rlTjxizVBghpGnlTzkkp4Ga9HLEQvGwzN6wESoKb8kYsuBOT5NBNNm2Snrpfv66xcktHugtY5xG+iKfk8KSxCmEdQPSyeabG4aHZYzJqYEaWPu+Z7FFveaJpDHZGHef9ZMcZnnxRt7p6rh89a1Ee9Z961erYqd7WPTgDkT/1YLUNx3Gk/2GFJR2QYwvk/Yq7IJyMPkrnnAfoa24KqyCMV/0fgPfdoePdHyt4MyLo4Y+9yd1r+GG9OrZ/k8lGejvP8Num+2fFP7Khw3LL/gcqog0ACmVuZHN0cmVhbQplbmRvYmoKMTIgMCBvYmoKMTEwOQplbmRvYmoKMTAgMCBvYmoKWyBdCmVuZG9iagoxNyAwIG9iago8PCAvTGVuZ3RoMSAxMTI4MCAvTGVuZ3RoIDc3MjYgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnic1XoLeFTV1ejaZ5195pwzrzOTmbwfE/LmlZgYIBBkjBCeYpCABIsmkISHQAIBAQMNakmMYkGBIBQxVUReYqAICQaUmhYR0Fqgt/5qrRpffyNSf1o1hJO79kyCYP/er/2++3/3u+fMOme/93rvtWYGGAC46SGDb/TIUfkwCG4FYP2pNXp0wR2To/6jv0b1MQDSI6MnT8nbtGvXSwD4O+q/fPuthWO0EVk/o8ltVP/7HZPTM+d+X0F13kr1qbMWlFRK39ifBVA8tMbNs+5f4oO5MTkAFlqPmeWVsxcsuvn+eQAa1WHv7JKqSrDQDdplqttmz19RvmTUU38F0Kka/t2cspJSdcavawH6fEcNg+ZQg/0Zy3qAhAyqJ85ZsGT503+yf0z1Qqr/fH7FrJIy9z17qd5O9XELSpZXyk8oi2ioGO9bWLKgLOWr4YRrYgHhc76yomrJR/UX3wJITqT+hyoXl1UOs9D2kLyLaJoDglc2CF4S3Qix1BYHCVQWfToMhFyQRuZPKATH/JIlCyGc+EpXdzfAtZIYye4rW7wQVFEKgEwriLcKEntDjGQfSwZYwdOzG3TvIDgSKJ0jeL/7CMH7ogTXLlE2XyTYEaidEE9zh3ku0HZK9Jknro1dRfBI73qi3B2cdaR3pf8bV3CdIKb/Uxetfu66nc79T+71b1yMpKcT2MAJBnghlHQhGmICGiPkygISJ1MK6kXg4qDQ09JTU0GjFRiBNzBajBQjLIEe6FkdwA6OwB4ucENIjx5uh33gCejhAyWLS2bC+pLFCxbC+pmLS+bC+lklC6voOadsMT1XLJ4P62eXVVB59uKy+2D9nJKFNGZO2Uxqua9kYQmsn19S4RNP0uefLShZMgfWL7xPtFTMLlkA6xcvXUgjl5QvnE3POWL9f6LzAa7Mnzu75Aa9lyGo94w8j3hzoslDnOoDqWRXSHR5qCzebognLnjAR083WZ1EbYxuo4fmP8MXUAC5NrB8ILaSHqLti4mpF4Jbi/L1V2+dzekpT/kXhOqgde//F6Q/PThWvK+Vr1vjhvbp19UX/1CWhgBcWUfl3GtTU4lKPbhI8Ck72HriGPAsvoWqscE3/i8ol9y0glVBVGVJkq/N6LkKykeVgp84uULxmB621bKAfXLDGOyBaAg6341UY4G6DMvpHUlcR+rxQTbxvBDughKYDfNgPlTCMlgRkL0PMq71lcJc6lsIi0Vf9yfd57uPkgd7uXtf957u3d27ul/o3tn9/I04/ugK+t3G62q+HhB4Cn+eDUF9K+gBgXdhD1gJ7uoBMbekB+wEpQSzCYQs5hLQeUT6BYRvEIQPXkhQSeAlEDJaRhBOsKIHYlk2NMNpuk/AHtjGdlKtnNoXUUujdBDWwFJqeZ2dZvXSAGrbCZfgHI2sg9O4RwY2DrKoFeBdLsFlVgiHaI0c5mE5FoVMZKJ8SL5TbpY/l8/CYLlKPisXy1UsC5/lU/lOghz8Dcn7FNlEM/sQquAofolZ2CqPlB3wIZ7FPfAp7SJ4cxrWwQ6oJlw8rAJqpGrpTmo5yc/CVrorqP8s287OEXZH2cNwAZ5CWRoD29kFous0/B0exkKphlieJZUT/idprbM0fytUkWu6wHQwpX7URtjTXjMDzxgcwC8E7ktQQzsXwg6lWfFYEmgXwbGd7HXWoWwgyZ7Dn+AifI+tkRPkXfIYWBfkABbDOlp7q5ijlLMVRLu4q8Xq0jK5mO2BL+Viy0xa+zeCItrzkHQnUVQOrQTLFINoGsbWYD1hKnpj4KxlnJxO82kFyyqiGqACs0nuFdS/Hw7CAGyAdbRSgF5lMP87zdwmf0Q0r2OPS3+HszgS0qBcvki8FurRAHDEonAZJQb9fUaTlDS2tMk/aZrvjaL4Af1/VPUZFl8TFDTZV/iau7sLpslRvKiJRzdhktokJyV89M86PxrQf3zBNF/T1VEje1YdVTyS2iZPo6KoUTO1jxoZ6BObNvEk+owtbvLNmuN71Hg0YeijRtnQAcJ+JBFpkJUileq6P5HXkXSsEAYJ/hCl0Q2Ntifca8O1aGcsRnujwo2ujssdN4HRfrnDuJjB+kguw52V6XYZUkomuAxI6COe0mPbnn6aPk8/fYVp5rdXrpjfMo0XmGfNMwRnWRbdN7OsRrPKrDXrzCr2OFvBHmCPC7/zEZnudPLUOvj93jxslKVG/qAFGjU1TolGiGNW4/z4JmfhtBYa7B9S1NHWRQild2Re7jjfkUE8KOrDDjnRKUszBse7eHZSliveG2+yceYWVvYmG9e1Y49cNaZ5TOeFPbQAyUseRxRHw3Z/SkRkFIZHu7gMLs7lPOOXro32Rs8TMtktGLrE9OgwA5UYo2t8k7dwfFNo4d3jmzyFdxMm2P3akKK28x2vveZy5/RgczmAjcXgX1n4V6wp2nCF5RBu/swp8lQ+1fKA/AC/P6ouwkJWHSFHknijl8D9ytLIqqgl0Q9BbcRDkQ9FPRS9C3ZFuWbAjCQiInsQDL6FZd+cnNBHsWTfwrIyZa9HsShAruRE1wRiY1bJ7S/U3ntu+QPnp33BPKPujjAv79mzZxl7YuiCzWOXNeTdduamzC9+/ZPnK2PMvxD120jeVUR9KlT6B4I3RK/V4mp9IY1ee6O2QYlu9G1IeEJZ630uLTQ6BNATEZ3sM6LRE6cpaYIJoYW99GsB+okBlzuISuKA0dF+ub3D+OyiEbiJKxnMr5XGlsSV+ErjZZjBYpnXI8f3SU7JjiVCBhFV/Vh2sHADeTjiiefM35lf3HNyXuEbC46fbHl+/+FN2597avLxxVWnij5jtp9jUlzb+g++SUp6/abMhnU/27RzWWVVdWLyIZ/vnYMr9woNJ78u7yCdksjLP+iPYXa0A6I9D9BqaeQMH9SYTYdoRZVtDuP98U1WIsweIMwmCDuf29aR6RJybT+f25FJtAQEK58i4Z4SIu1rhb4wBoro0FgGj4IllPWDZNYPB7GJ7A7bHfaprJwtZQ/gGmYnUWosHrNcWd4EV4IrPhsVU2Jmtnnhwqmr9/Ckrk/wbFfWLrORFb9OEtpOEiolzGPgHn+CHGlx1RoxkY0WT6NRb5ca4UH7WsuO2LBopmM06IYSa3Sx6+ViCPR7rMUQ1kIiMtouCgMWFkziMduC0gkh/XIJnoPXAzeIRUjjA4y42th/Wv9OlmieN7++5/U501+778U333xx0i8L+YU95pNOp3nxP/9q/s3nO31TxuFt2w4nJhO3KdOQHyfsVYpGh/rD2SYDNmkPug1dpUCER9hHuCBakz3kVDK7hOIQhwNW7Lc6vXHeEd57vS95OZvBXD2YJMULlZD7MVcC22A+vnXr4+YQ9sYVxszuK+abPP3q20/W1T6585P3Pvj46i5g3Z20/5e0vwXG+R2KtAkelJmfOOXnqnG+vas9sF9mBrFML5x2jND0BwJPlbjmGjykyB8CWhwYzJDiLIbm1yq1ZzRtBgpOkeQU+eurF09fvUj0d17g/YTvGkmnyE7az8aq/WN5pMI1XZMjdQ0jdasuRTLJatUVl0W1cJfMVdUiuVCy0WgXhSJ5OpdQQXjJqtqsuqYGDxGrBezG+TMB1pDqZYblXHMqhtoD/Id3sFjU5yA4GJvh/0aRFC6JSFF366k8Uffpt0i38Jv1DH2CdDvP0/16kTRPuo/P1ov1aqlGWslr+Gq9QdrEYyygSSolGgoXqQazyMQXiwaarOs2cESiV/aqETbD4ZPjuU/xWXxqgpaoJ1l9Dp8jVxqK2XIWz1AHaTnWEbYMRz7ks3GSXx7F/TxPybPkqX7Vr43Ub7f5HX7HNGmqWmQrcJRLs7FEnsmLlWJLsVqqleql1mVwP6uWluMyeQlfoaywLFMr1eW2GluNo1aqw0fker5Ge9S6zrFZfsbxkuNusq+QLI2JD0vQWMLIM2woy/lEPM6a9ab5G/PXJr9wxS1fFNDZjxudl0hTq8nOBlAkoUMStNJJEGcN0xywO0xpcbh8tXFHo1sSml1rw2wQhuF2TbXGoeoZlUyKe+Y8ScUlXF56W/vlLjKq3wp/l+PKIaPyL8yIyYjNiMvwZcRn9BmR4o/xx/rj/D5/vL9PQUxBbEFcga8gvqBPQUplypqYuti6uDpfXfyaPutTGlMupcT2Tu2d1DuhOLY4rthXHF8ZWxlX6auMXx27Om61b3V8+AyylT6K1xNK9jKcDXYlZDtYQp/k7JsHZcVf71dDpeMf7nuwYktLc/OI1kf2nb56hUkvbC4+XFh2fPp/XZKyyqtnVr17KG3C1Qf3lJecePbYa+6axwYO3JOS0iW86SLi1XTFQw4lGob4IyJbwOFp4epaRzPbjGEyqNJol9s6KiYQK2RmiuOw/XKb8DsZh4tjV8c2xmLApnvQoxACCCUWwDqIJT7b3Dz0wMrT3dB9euWBqydfePLJXbuefPIFPCzd833HrtISNpKpdI8sMb2nP//8NEEPXjUkQw9E0XmWCF6m1aqPcO9uxlts7JXwFnezbW10lFdSvSqMl9zOUdEBFNsCZ7YIadoDp9XlgEP0p42IqYxpjPldzKUYPgJGsBHSCO+IKN7fkq6ma/31CqhgFVKFtyJKm7GI6PHGBxzmYC/R5IMATWAZyATP5Zqug7azR+adnDnrd/eZl82TLK3rY2Zplp5/ZGuLQ7pn+vGTN9+8v29/NoTpLITdZn7QtvnQ/u3Co6QTw78jXodAkT+aG8ym7lZYHWx2KK26FGIBi8ZVu9M6wSPOY124eatw8+ObHIGycP+5bV25bW3ugIq2k581LtIJRsfxYb+3wNvoRUKdkIxhQbeWkJ0l1EX6rmnW7SzdfKelqWn/McWzpWDOrHVd6fjOuomv7BW8NqfK04nXVoodxvkTImwxmrs2JLTFiS3JCc0prVqL81hkTHIEqLbRitvtG0UBQ1uvOrS1BxXCvCA4nUNa0Xd138a+QisC4g8qcJgh/XAODWc9qkLRZmhYNiU8z2/a+PzzGzc932yanSX7Jk3afufLh3IOrnyrq+utlQdzmqXhb7z//hsn33//L+bH5pcxsb/q3/fYq3fPmkmOAJnMhs6ctUfo8gliMmW+FP1aoB+dEcflA9AqcabKkK8aXeRyBde6xJEkToACrZhOATqSQkh7xRF+opkuufhKo+L5ktbrfs+cGljPCk4Y6Y+2ShZwHLdZ6vgxaLUdMFSDK3fYmWqDfCOwenuOuyda6jCCzKCNXH5XgavYVekKbuRReiKj4IbPvZx/09wJgV3X/uG1bSVblNQvSVN+oKTPEdgsMRXyZSMYGmf47Qb53QJezCv5Ja4E0SfUFc/3HULLjtI5WUrSDIFB/gjUAB1MqXO4mm2tOpNUmCg8Xn7gmBbBf3qukJ7LTWfRoWLv215JWHOCKyi6AFsCPkYubV65ctO+lpa8Xy098Vtpx9WfSNuf2X58x9U6uXh/WenXPRa7NKBFYaRFIUqLG1pszSLfcDsnods76kf5hj9hREQ1VCs1lhq1RqvRa6zVthp7jaPGWWPUuKrdjRGXIlzX+UDyJjekJVUb9+3dtGHfvg2XmNu8eOmv5tfMhR9+furU51+8cfLLbeYbZof5FZlnDlmhhw0hDI+Snu8gDIWvu8Uf1evrmh1r2TFsjSE/Nzrg8fKFt8vMDOLa3uvu/FrQ3/05VmYzkq6xhnCRyC1fr+ysqqVl6IHqM9Ddfab6gDSEPN4LAnZd3a/oe0pLzFbzO7pbS9hfeh1eUG44jrBzQYbfo1Cs4LJinaNZa7XoCh3X+W5hdAENJi93/oxwa4cKQp4JERILuKbrxBWG4+LG9t/2AuFxdE3IwGg85HadPn71IAmrfBbntFtF9yd4knZLgc/9uXab5LBOjotVNcmiT46Li83TrbFxshdqWb3sqfXWh7e45JYkOjRTY3VrXJQF7oxSHRbV02dUqsDqfEe7UPecHs9rmH+7aPztors3vnF8ReGgJfAU0UyKiGYWROvR1mjbQHK+/a39bcO0Yfow6zCb1Qc+liil6qnWviHpnnRv39DU2NS4NF9afGJKrV5rrbXV2sW3OEySFF2xog3t6EAnGhiBkRiF0XKMlpKeNiLt3rSatNVp69Ma0y6lhVMgsYh5g1zyhMYF8hUl4frAOJ14KHxlZig+NnHX9Pr6mRtHtD3/7R+nvz6//LclD60t2+vf+9Sf3yo/JI/Yn5paWOgfG+/ou6V+2+GEhOPZ2UWTxhckORM3PbR9X6yQ5WByH9/w7WSDdKY6uOrE3eBirWqdbiUek44ZboewwYA7zwx4847LwcQkJ+PgS14mrFD4cE/oMOHRk7OFL3exZazaXDO+6tixC8/W1fHt5q/XXW2sn7j1md9LxevYLcIH7icrnBawfg8M80f/YP9rddbqabaR9XusE8kP5HuFOeYENao985oTqPC+JpxACB3pQbO7drYns/3CCbzY3HzbgaUn3mBvs6PSzqslzzxzfIdUfaVxX/msS7hLUD+cPFCNXAwKXPGnIEXHsuRiEhcvlBSKPylEVvIkhFe5wikuphzdIr4HCBx5EDzyPIUiLxdpDwTS0bBgJv7fRswia/v5GIp9RdRbK62WnpB2SKrYSEONtNjLIjFSTqYsLg3TZJ+aDdlsKA6VM1QRyY7FsXI+H6P41akwlRVhkVyglkM5m4tz5dl8jlKsLoUlrBqr5aX8AWUNrGH1WE+Raq3SAA1ss7QVn5Kf4puVXfwFpUl9Tf1Q7VZv6Y1cWcLw19k97J7XzZ90ysVdhbjvSiNxaCoxIJs4ZGN/odxiSjC3mEK5xRSRW0z5l3KLV/+b3ELwcHyTS3yX4RaPEPGwBtko+ErJkS2YDo9vMn7g7r+dkjB/N5dCpVDeR8/Wx0pjeT5lH3dLd/MpeoG+UFrIy/UVJIsVlIHUSVukp/hGvVVq5W9JJ/FtHsMlDRXZynXVqtHL5pUiMFSO5FFqlOaxem1JkMQSpBSMl5N4H6WPJUlNoWwk3ppgy8FB8iA1R+Qg0hjMl/1yHvcrfotfHUn5x0iryD+EFKdKBfIkfqdyp6VAnawV6lOss6CUlUnzsEyex+cp8ywLtRLrbFuFYyksZSukVbhcXkXSrVEesNRYlqsrtBqtWr/fuspWJz3CKR+BzWyjtAG3yb/gTylPWbao/vQG2zOOnbCT7ZB24F55L9+t7LbsVXfYXnK8LB3AY/IrvFl71dEmvY5n5Df5ioBGRDHxYQlWljC1+bNP3/3s02bzvXf/+s27pBsNOE/AlUZs6JonrIhOK2FFTvaY/zaLKmkucOouqw7gdLic4LS7bHYQL4edlMbmIpXJs1s1A6y8Do85rK2Gw27TNdIU1Sk7rUavdqgBuVt7zaw3smxrC+TsBrmCjp6M559pAv8qLFPowCUFuKpoaA/Vw+yGPcGebR+r36FPtE/Xpuvz9Dr7avsGu5vyVU0hKVsdVmcY80qGbPAw3WP12CIdkc4USCR/75N9PE1N1ZL0RGuiLcXe19HX6XMNJjvNljLkDD5EH2QdZBtiz3HkODNct4Kf+SU/+mV/j/TztFH6aPtYx1in31UIk9gkaQoWyAV8qjLFMlW9S7uLNGCKrchR5CxwlbNyaY4+1zHXWeyqVpc7ljvr4VFtjXWNrd5e76h3btE2WTfZtjq2OndYd9j2OvY6m1xvuz50dbvKSIbcwYIB3AgWyEulDRM3rtwwf0JhVrw5LGjqc954YOuY2kJ5YtdGnN8T9/M9FM0lwhh/SHIgzLfFh9tjVZct3vBMSBLnQKY4CYxcEd233QR+l2Z37XZLkXUQvlmJc7danem5n2VmmrkXMynkz8y4Icz/IdQPHBYW0SHOMr6nN+43LYHQf3/TrJRk9v0NOUBvHrAlNXXOrGA+kC6+RyV8vRABd/pdUfkQpoY6PbKqYqiuTIj8AV8zl2Ikv1ul482oc4QfDz3g2KxBK2cC24tm4OvCTIqCu6Mao9ZHrY4yongwvvwRyoQxo3hYHhfE9MVftQjMv29pEdF4L45HXhJIs4Nf9vC0B8fBfmdYPsX4uk1VDdntmBAq8AuiJ7CjjGq3JlOa5dJa7ZJAzAxgxQKs+3GyJI2ToMU88kO+JMJqXv2jjIl2V7po9zQYANP94en5Yf3UvkaUV43sq0GcoibGan2SJwz8gVFtmeLZFWBXWFRcwu5EF2V+A473PWDA5lBLYmtETHx6bm57ZqawP6Mjkz5BKfdIc/CgwddY1Svz61I8TtwTaZ4Q713RKRMfJlHfLh0U7OyRPRI3SexBKU+JDkkRzOxlbi9pUi9vA9Qlw0R/aGq+XTVCwz2qoYkvVeOjtLiECSnXURYgLKAG4dG+3fEuqc6WvNlriW91RsYGSbqc+4/0DMr6Eft/lK8G9fR6WfTQcY2GvdfL5ZpsxG+NUtEW5gmrvdeZ+zeIUwM/LL7zU0d77/vbP3RNcBRp4v8E6rXfIWmeZYEZA+Awv/1D5yRH0T/8apkgnw389geSSDYfI5sOgzqCjwgaCLYRlBJsl8/BKfmT7k7LlzBS/hyquQcWyR2wSHoH0kVZyoETUk73e+LN3XBU/hQWUftRHEflflCBCTCY2vfLrTBceQymijd/mObqQbC0koTSb8AtD1bAd2w+OyWlSnOkRlwlh8j3yqe4l6/gv+R/4l3KEKVe+aPFaTluuaIOV9/S/qz/xhpjrbDut8XYVgbph0QshH4wB2yB3+C3CO7IXimU3uI3RgtMF78Wy+J/QxmB32dFmUEo1YJlCVSW31PG69rl68ocwtnEnrICHlYOt0EFVBL+i2EuzKbdl4APUmEW2ZYPMiGD7iwqzaQRPqJzLvVXESyGMiiBBdCfWsfCQho/kEq3wny6fXDntbWqArUyepfRnPvpWUoj9X9h10HXdi2kne6nvcQviQtptMCjhOb8ezuOpNI8mjcVltKIWTS2JLBaWWBGSYAiH62ykJ6VNGYmrTuXxvlofgXtXhLo+/E6kwOrVBFGFXTfR61i1yoaWxFYKZP2zoLsG2b1zgn+UwW6fxr4b8I/XokBvRD/UBHZhPivSxiE02qDYSitNxrGEOfHwQS4He6ASUTzZJhC698F06CIVvwJk+A1OCG+obEsXTg3Py8jo+ed1fO+GaBZWu3vvmJipwe/T8LvMvHbBvy7A/9m4mUT/ysJv3HgXxvwUhJ+/eit/GsTLzbgVw3Y0Yl/6cT/NPHLofhFHn5u4meZ+Gn7ZP5pA7bTwPbJ+MnH6fyTTvw4HT8y8c8mfpiJf/LgBw34vonvufE/VuG7r+AfTfwDDf/DKrxwfjS/sArPj8Zzv4/i50z8fRS+Y+LvTHzbxLdMPNuAZ07H8jMmno7FNzPxlIm/XePiv43G34Rim4mvm/hrE0+Y+JqJr5p43MRjJraa+IqJR13YUpvEW0xsPkIxoYlHDs/gR17BI6vlwy8n8cMz/N142C+/nISHTPxVAx408YCJTSa+ZOL+UnzRgfv2JvF9pbh3j5vvTcI9btxNSO/uxF0mvmDiThOfd+MOE5971sGfy8RnHfjLUmykIY0N+IyJ25+2UaaIT9tw2y8i+LZS/MVWg/8iArcauEXHp0zc3GDnm01ssOMmmrSpATducPCNqbjBgU924hPrX+FPmLh+3Qy+/hVcv1pe9/Mkvm4GrvPLP0/Cx01c+9hAvtbExwbio0Tmo7di/SNWXu/BR6xYRw11pVhLnKpNwjUu/JmJDz/k4g+b+JALHzRxtYk1Jvq7f7pqFf+piatW4cpSrC708uokfMDEFSYud+AyG96v41ITl3RiVScu7sRFnVhpYoWJC02cH4/3mTjPlcfnTca5Js5ZhbOpUm5imYmlJs4ycaaJJUOxuBPvseEME+82cbqJRdN0XtSJ03S8KzSC35WJU02cQjtPycNCL05mBp8cjnd6cNK4ED7JxAIr3mHixNsNPtHE2w2cYOJ46hlv4rixBh8XgmNj7HysgWPsONrE/AYc1YAjTbxNGsBv68S8V/DW8eg3cYSJtwx381s8ODzXyYe7MXeYnef6u504zI5DTcwxcchgDx/SiYMHGXywBwdlW/kgA7OteHMsZtkx8yYrzzTxJitmpFt5hh3TrThwgMYHGjhAw/6Z2K9vEu9Xin3T3LxvEqa5MTUliafeiilJmJxk5clOTLJiookJJvZxYjzRGe9GXynGdWIskRBbijF2jCYORpsY1YmReRhBlQgTw0sxjDgVZmIoTQqNQK+JHhNDTHTTALeJLqLVlYfGKnSWosNEuy2U20200WhbKFpN1A3UTFRpmGqixYNKKcrUKZMGeJFa0aRE1uDSAGQGgomsmZWueZz1+//hgv/XCPwfr5j/DUId1MAKZW5kc3RyZWFtCmVuZG9iagoxNiAwIG9iago8PCAvTGVuZ3RoIDc1IC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nKWMWQqAMBBDH1j3pe673v+YhtJPQbADeckwQyBwoo+7ISYhJSP/1V94Lx2rl4/ascHS0in1DOIoTcws8pWNnYOTS9v9AEC0Ac8KZW5kc3RyZWFtCmVuZG9iagoxOSAwIG9iago8PCAvTGVuZ3RoIDMzOCAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeJxdUk1rhDAQvfsrctweFnfdaiiIULYXD/2gtifZg5uMi1BjiO7Bf98kL7HQgD7emzfjxJn0XL/UalhY+mEm0dDC+kFJQ/N0N4LYlW6DSo4Zk4NYAvNvMXY6SW1ys84LjbXqp6QsWfppg/NiVrZ7ltOVHhLGWPpuJJlB3dju+9xAau5a/9BIamGHpKqYpN6We+30WzcSS33yvpY2Pizr3qb9Ob5WTSzz/IiWxCRp1p0g06kbJeXBnoqVvT1VQkr+i9vr+LRrv/kz5we0wIuTT5BPOeRIj4AMcAI8RqvPzGHNQ8E8FMzhctACIXPIPMgccoEPOWiBkFGkCO5Ic0ARg/A+gXbBu1FEBagIURFk8jIPzQda9FH1Jo7r89BFpPgbHD1x9MSLaEUmmnDQAi9uSnEcbmBuu7ZtEHdj7CL4FfQb4GY/KNq2VE/aZbnnF9ltwAMKZW5kc3RyZWFtCmVuZG9iagoxNCAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvQ0lERm9udFR5cGUyIC9CYXNlRm9udCAvQk1RUURWK0RlamFWdVNhbnMKL0NJRFN5c3RlbUluZm8gPDwgL1JlZ2lzdHJ5IChBZG9iZSkgL09yZGVyaW5nIChJZGVudGl0eSkgL1N1cHBsZW1lbnQgMCA+PgovRm9udERlc2NyaXB0b3IgMTMgMCBSIC9XIDE4IDAgUiAvQ0lEVG9HSURNYXAgMTYgMCBSID4+CmVuZG9iagoxNSAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTAgL0Jhc2VGb250IC9CTVFRRFYrRGVqYVZ1U2FucwovRW5jb2RpbmcgL0lkZW50aXR5LUggL0Rlc2NlbmRhbnRGb250cyBbIDE0IDAgUiBdIC9Ub1VuaWNvZGUgMTkgMCBSID4+CmVuZG9iagoxMyAwIG9iago8PCAvVHlwZSAvRm9udERlc2NyaXB0b3IgL0ZvbnROYW1lIC9CTVFRRFYrRGVqYVZ1U2FucyAvRmxhZ3MgMzIKL0ZvbnRCQm94IFsgLTEwMjEgLTQ2MyAxNzk0IDEyMzMgXSAvQXNjZW50IDkyOSAvRGVzY2VudCAtMjM2IC9DYXBIZWlnaHQgMAovWEhlaWdodCAwIC9JdGFsaWNBbmdsZSAwIC9TdGVtViAwIC9Gb250RmlsZTIgMTcgMCBSIC9NYXhXaWR0aCA5ODkgPj4KZW5kb2JqCjE4IDAgb2JqClsgMzIgWyAzMTggXSA0OCBbIDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IF0gODAgWyA2MDMgXSA4NCBbIDYxMSBdIDg3ClsgOTg5IF0gOTcgWyA2MTMgXSAxMDAgWyA2MzUgNjE1IDM1MiA2MzUgXSAxMDUgWyAyNzggMjc4IF0gMTA4IFsgMjc4IF0gMTEwClsgNjM0IDYxMiA2MzUgXSAxMTQgWyA0MTEgNTIxIDM5MiA2MzQgNTkyIDgxOCBdIDEyMSBbIDU5MiBdIF0KZW5kb2JqCjMgMCBvYmoKPDwgL0YxIDE1IDAgUiA+PgplbmRvYmoKNCAwIG9iago8PCAvQTEgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMCAvY2EgMSA+PgovQTIgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMSAvY2EgMSA+PiA+PgplbmRvYmoKNSAwIG9iago8PCA+PgplbmRvYmoKNiAwIG9iago8PCA+PgplbmRvYmoKNyAwIG9iago8PCA+PgplbmRvYmoKMiAwIG9iago8PCAvVHlwZSAvUGFnZXMgL0tpZHMgWyAxMSAwIFIgXSAvQ291bnQgMSA+PgplbmRvYmoKMjAgMCBvYmoKPDwgL0NyZWF0b3IgKE1hdHBsb3RsaWIgdjMuNy4xLCBodHRwczovL21hdHBsb3RsaWIub3JnKQovUHJvZHVjZXIgKE1hdHBsb3RsaWIgcGRmIGJhY2tlbmQgdjMuNy4xKSAvQ3JlYXRpb25EYXRlIChEOjIwMjQwNTI5MjAwNjQ1WikKPj4KZW5kb2JqCnhyZWYKMCAyMQowMDAwMDAwMDAwIDY1NTM1IGYgCjAwMDAwMDAwMTYgMDAwMDAgbiAKMDAwMDAxMDk0MSAwMDAwMCBuIAowMDAwMDEwNzQ3IDAwMDAwIG4gCjAwMDAwMTA3NzkgMDAwMDAgbiAKMDAwMDAxMDg3OCAwMDAwMCBuIAowMDAwMDEwODk5IDAwMDAwIG4gCjAwMDAwMTA5MjAgMDAwMDAgbiAKMDAwMDAwMDA2NSAwMDAwMCBuIAowMDAwMDAwMzQzIDAwMDAwIG4gCjAwMDAwMDE1NDggMDAwMDAgbiAKMDAwMDAwMDIwOCAwMDAwMCBuIAowMDAwMDAxNTI3IDAwMDAwIG4gCjAwMDAwMTAzMDEgMDAwMDAgbiAKMDAwMDAwOTk0MSAwMDAwMCBuIAowMDAwMDEwMTU0IDAwMDAwIG4gCjAwMDAwMDkzODMgMDAwMDAgbiAKMDAwMDAwMTU2OCAwMDAwMCBuIAowMDAwMDEwNTI1IDAwMDAwIG4gCjAwMDAwMDk1MzAgMDAwMDAgbiAKMDAwMDAxMTAwMSAwMDAwMCBuIAp0cmFpbGVyCjw8IC9TaXplIDIxIC9Sb290IDEgMCBSIC9JbmZvIDIwIDAgUiA+PgpzdGFydHhyZWYKMTExNTIKJSVFT0YK\n"
},
"metadata": {}
}
],
"source": [
"df[\"Words Per Tweet\"] = df[\"text\"].str.split().apply(len)\n",
"df.boxplot(\"Words Per Tweet\", by=\"label_name\", grid=False, showfliers=False,\n",
" color=\"black\")\n",
"plt.suptitle(\"\")\n",
"plt.xlabel(\"\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eywME5BEYHZb"
},
"source": [
"From the plot we see that for each emotion, most tweets are around 15 words long and the longest tweets are well below DistilBERT's maximum context size. Texts that are longer than a model's context size need to be truncated, which can lead to a loss in performance if the truncated text contains crucial information; in this case, it looks like that won't be an issue."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PQXt1tQ7YHZc"
},
"source": [
"Let's now figure out how we can convert these raw texts into a format suitable for image:images/logo.png[hf,13,13] Transformers! While we're at it, let's also reset the output format of our dataset since we don't need the `DataFrame` format anymore:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "_SyrEz9DYHZc"
},
"outputs": [],
"source": [
"emotions.reset_format()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pK-8S2f-YHZc"
},
"source": [
"## From Text to Tokens"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ItGh3KnEYHZc"
},
"source": [
"Transformer models like DistilBERT cannot receive raw strings as input; instead, they assume the text has been _tokenized_ and _encoded_ as numerical vectors. Tokenization is the step of breaking down a string into the atomic units used in the model. There are several tokenization strategies one can adopt, and the optimal splitting of words into subunits is usually learned from the corpus. Before looking at the tokenizer used for DistilBERT, let's consider two extreme cases: _character_ and _word_ tokenization."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xble8azeYHZc"
},
"source": [
"### Character Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OEwh5l7rYHZc"
},
"source": [
"The simplest tokenization scheme is to feed each character individually to the model. In Python, `str` objects are really arrays under the hood, which allows us to quickly implement character-level tokenization with just one line of code:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "lZ7y52rxYHZc",
"outputId": "fe0d0fcd-93fd-499f-bf55-198fd432ac9b",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ',\n",
"'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o',\n",
"'f', ' ', 'N', 'L', 'P', '.']\n"
]
}
],
"source": [
"text = \"Tokenizing text is a core task of NLP.\"\n",
"tokenized_text = list(text)\n",
"print(tokenized_text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AIWQvWOjYHZc"
},
"source": [
"This is a good start, but we're not done yet. Our model expects each character to be converted to an integer, a process sometimes called _numericalization_. One simple way to do this is by encoding each unique token (which are characters in this case) with a unique integer:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "xCc0cFo-YHZc",
"outputId": "8fdc79f2-e7af-4484-e2d2-79807a7d88ed",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9,\n",
"'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18,\n",
"'z': 19}\n"
]
}
],
"source": [
"token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}\n",
"print(token2idx)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WZeewGwvYHZd"
},
"source": [
"This gives us a mapping from each character in our vocabulary to a unique integer. We can now use `token2idx` to transform the tokenized text to a list of integers:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "DJNFQtxvYHZd",
"outputId": "d4393365-60b5-4a79-94b6-cbc10419c2a1",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7,\n",
"14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]\n"
]
}
],
"source": [
"input_ids = [token2idx[token] for token in tokenized_text]\n",
"print(input_ids)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wpWT989KYHZd"
},
"source": [
"Each token has now been mapped to a unique numerical identifier (hence the name `input_ids`). The last step is to convert `input_ids` to a 2D tensor of one-hot vectors. One-hot vectors are frequently used in machine learning to encode categorical data, which can be either ordinal or nominal. For example, suppose we wanted to encode the names of characters in the _Transformers_ TV series. One way to do this would be to map each name to a unique ID, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "gHkAd8l_YHZd",
"outputId": "5bfe908a-9028-43cf-f0c4-a74649dd3e3e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Name Label ID\n",
"0 Bumblebee 0\n",
"1 Optimus Prime 1\n",
"2 Megatron 2"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Label ID
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Bumblebee
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
Optimus Prime
\n",
"
1
\n",
"
\n",
"
\n",
"
2
\n",
"
Megatron
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "categorical_df",
"summary": "{\n \"name\": \"categorical_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Bumblebee\",\n \"Optimus Prime\",\n \"Megatron\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Label ID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 0,\n 1,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"categorical_df = pd.DataFrame(\n",
" {\"Name\": [\"Bumblebee\", \"Optimus Prime\", \"Megatron\"], \"Label ID\": [0,1,2]})\n",
"categorical_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zSQvpDK5YHZd"
},
"source": [
"The problem with this approach is that it creates a fictitious ordering between the names, and neural networks are _really_ good at learning these kinds of relationships. So instead, we can create a new column for each category and assign a 1 where the category is true, and a 0 otherwise. In Pandas, this can be implemented with the `get_dummies()` function as follows:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "-8-nfnyGYHZd",
"outputId": "cca1127e-95ca-46e2-a2cf-e32e82048827",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Bumblebee Megatron Optimus Prime\n",
"0 True False False\n",
"1 False False True\n",
"2 False True False"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Bumblebee
\n",
"
Megatron
\n",
"
Optimus Prime
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
True
\n",
"
False
\n",
"
False
\n",
"
\n",
"
\n",
"
1
\n",
"
False
\n",
"
False
\n",
"
True
\n",
"
\n",
"
\n",
"
2
\n",
"
False
\n",
"
True
\n",
"
False
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"pd\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Bumblebee\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n false,\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Megatron\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Optimus Prime\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"pd.get_dummies(categorical_df[\"Name\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-nXiduTYYHZd"
},
"source": [
"The rows of this `DataFrame` are the one-hot vectors, which have a single \"hot\" entry with a 1 and 0s everywhere else. Now, looking at our `input_ids`, we have a similar problem: the elements create an ordinal scale. This means that adding or subtracting two IDs is a meaningless operation, since the result is a new ID that represents another random token.\n",
"\n",
"On the other hand, the result of adding two one-hot encodings can easily be interpreted: the two entries that are \"hot\" indicate that the corresponding tokens co-occur. We can create the one-hot encodings in PyTorch by converting `input_ids` to a tensor and applying the `one_hot()` function as follows:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "4l9Ao24DYHZe",
"outputId": "11cb8662-3a31-4844-d5bf-702f0bb33bfa",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"torch.Size([38, 20])"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"import torch\n",
"import torch.nn.functional as F\n",
"\n",
"input_ids = torch.tensor(input_ids)\n",
"one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))\n",
"one_hot_encodings.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xPs95qM0YHZe"
},
"source": [
"For each of the 38 input tokens we now have a one-hot vector with 20 dimensions, since our vocabulary consists of 20 unique characters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-hU7j83PYHZe"
},
"source": [
"> Warning: It's important to always set `num_classes` in the `one_hot()` function because otherwise the one-hot vectors may end up being shorter than the length of the vocabulary (and need to be padded with zeros manually). In TensorFlow, the equivalent function is `tf.one_hot()`, where the `depth` argument plays the role of `num_classes`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kqllipJiYHZe"
},
"source": [
"By examining the first vector, we can verify that a 1 appears in the location indicated by `input_ids[0]`:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "gQHIIOzyYHZe",
"outputId": "7adad57e-1fa1-4962-cb75-2c2e3bc02678",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Token: T\n",
"Tensor index: 5\n",
"One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])\n"
]
}
],
"source": [
"print(f\"Token: {tokenized_text[0]}\")\n",
"print(f\"Tensor index: {input_ids[0]}\")\n",
"print(f\"One-hot: {one_hot_encodings[0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZQMEUWX0YHZe"
},
"source": [
"From our simple example we can see that character-level tokenization ignores any structure in the text and treats the whole string as a stream of characters. Although this helps deal with misspellings and rare words, the main drawback is that linguistic structures such as words need to be _learned_ from the data. This requires significant compute, memory, and data. For this reason, character tokenization is rarely used in practice. Instead, some structure of the text is preserved during the tokenization step. _Word tokenization_ is a straightforward approach to achieve this, so let's take a look at how it works."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ebU_7gFJYHZe"
},
"source": [
"### Word Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H2OUgxIdYHZe"
},
"source": [
"Instead of splitting the text into characters, we can split it into words and map each word to an integer. Using words from the outset enables the model to skip the step of learning words from characters, and thereby reduces the complexity of the training process."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9YNUHfFWYHZf"
},
"source": [
"One simple class of word tokenizers uses whitespace to tokenize the text. We can do this by applying Python's `split()` function directly on the raw text (just like we did to measure the tweet lengths):"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "atrv9WH_YHZf",
"outputId": "ff6666a1-4670-4eab-8669-c596b175e515",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']\n"
]
}
],
"source": [
"tokenized_text = text.split()\n",
"print(tokenized_text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UZbeOTssYHZf"
},
"source": [
"From here we can take the same steps we took for the character tokenizer to map each word to an ID. However, we can already see one potential problem with this tokenization scheme: punctuation is not accounted for, so `NLP.` is treated as a single token. Given that words can include declinations, conjugations, or misspellings, the size of the vocabulary can easily grow into the millions!\n",
"\n",
"\n",
"> note: Some word tokenizers have extra rules for punctuation. One can also apply stemming or lemmatization, which normalizes words to their stem (e.g., \"great\", \"greater\", and \"greatest\" all become \"great\"), at the expense of losing some information in the text."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-56vGAqJYHZf"
},
"source": [
"Having a large vocabulary is a problem because it requires neural networks to have an enormous number of parameters. To illustrate this, suppose we have 1 million unique words and want to compress the 1-million-dimensional input vectors to 1-thousand-dimensional vectors in the first layer of our neural network. This is a standard step in most NLP architectures, and the resulting weight matrix of this first layer would contain 1 million $\\times$ 1 thousand = 1 billion weights. This is already comparable to the largest GPT-2 model,footnote:[GPT-2 is the successor of GPT, and it captivated the public's attention with its impressive ability to generate realistic text. We'll explore GPT-2 in detail in <>.] which has around 1.5 billion parameters in total!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6bMPaWWuYHZf"
},
"source": [
"Naturally, we want to avoid being so wasteful with our model parameters since models are expensive to train, and larger models are more difficult to maintain. A common approach is to limit the vocabulary and discard rare words by considering, say, the 100,000 most common words in the corpus. Words that are not part of the vocabulary are classified as \"unknown\" and mapped to a shared `UNK` token. This means that we lose some potentially important information in the process of word tokenization, since the model has no information about words associated with `UNK`.\n",
"\n",
"Wouldn't it be nice if there was a compromise between character and word tokenization that preserved all the input information _and_ some of the input structure? There is: _subword tokenization_."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wh2-0X12YHZf"
},
"source": [
"### Subword Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6nDTfdjmYHZf"
},
"source": [
"The basic idea behind subword tokenization is to combine the best aspects of character and word tokenization. On the one hand, we want to split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, we want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size. The main distinguishing feature of subword tokenization (as well as word tokenization) is that it is _learned_ from the pretraining corpus using a mix of statistical rules and algorithms.\n",
"\n",
"There are several subword tokenization algorithms that are commonly used in NLP, but let's start with WordPiece,footnote:[M. Schuster and K. Nakajima, \"Japanese and Korean Voice Search,\" _2012 IEEE International Conference on Acoustics, Speech and Signal Processing_ (2012): 5149–5152, https://doi.org/10.1109/ICASSP.2012.6289079.] which is used by the BERT and DistilBERT tokenizers. The easiest way to understand how WordPiece works is to see it in action. image:images/logo.png[hf,13,13] Transformers provides a convenient `AutoTokenizer` class that allows you to quickly load the tokenizer associated with a pretrained model—we just call its `from_pretrained()` method, providing the ID of a model on the Hub or a local file path. Let's start by loading the tokenizer for DistilBERT:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "zHt1MSB3YHZg",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202,
"referenced_widgets": [
"38d010373e3a42ff8890e31ebcad24ae",
"73e2fe9043c04971954625d3edab0fc6",
"e4a64e91c7e04c75a3b4fa3e8be9757f",
"693164f263ab43e5ab467368c7a25cf4",
"c57908289873496da36ee1d32dfe6bd9",
"7f850e7727784642a4dc8d39589908f8",
"089182a616fb40eea4bee001936a464a",
"aa39ff9e34b4489c9639e1b206aeb0e2",
"d0a034a140d84b8d99a7827770a0ecd9",
"acfb8bcbc7614059ad32eda744203fe5",
"c16d40b16c7441218f3b21d24721a0ae",
"b9b6b98fb6dc466682918391bcc4f445",
"fd602f4263a44971beb74f1100c1e118",
"b956c5823a4246c6aa4a09ebfde17843",
"f14df8d7fde349e998fb50b1efd56259",
"8b65eb2ee26444cd90773cf7100f908a",
"00742b4279fd49c894a3635e1145f6c1",
"38ea439e946a4d9b816d91ef76f0b2be",
"490f4c5fd63e4e8ab7b14cf404e835d5",
"c26caa7f78ef41baa41c7c31452be82d",
"9f24707ff6544d2894c7fe910bd25c14",
"843721633a76446fbef290c58f1abbef",
"04b5cdc2af5441cfbb826584ba89faeb",
"b58bc8e9ca49444abb14e810fc5ed2f6",
"a1685c348fe047358ce1b1fe880c44db",
"328c0596010147fa9dc2cad5020196bd",
"c83140c3829a4d1b99d21d4ca4a71f84",
"d1da61bfbf06466f9f04c2636491602a",
"b4fccb2614a84b22acb03ab9d1bd9c6c",
"4de3fd20fe3743038f279a54e51f0142",
"1e988a18f4ad42228159594bfed8bffa",
"f6dc5a7376fe47ffaccc0593048c369b",
"5685df0767e84f059fc145c6e4b42a09",
"08a39d8364424669a940c6f1f5354a1e",
"f884b945d9584a82bacc064560b3e019",
"fb186a0c74f84951b0c6831e7fee780d",
"cc00015ea20b403b82e8e67145aa6be5",
"738a8f7d954f40c1bad963159dc8db1e",
"cfa689f99ae0403f97fb49f6637760ad",
"8382e44639ce41d18ec4d1fc260cc1da",
"3e575479e7484b70920229916ca27bca",
"42e308ad50fe4135976108792ac14c62",
"5c12553e014841a9b6be64c902d8a581",
"f647956635084e87a0519bb6f050cdcd"
]
},
"outputId": "abeae742-aa43-4639-c5dd-91037260c756"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/48.0 [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "38d010373e3a42ff8890e31ebcad24ae"
}
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
" warnings.warn(\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"config.json: 0%| | 0.00/483 [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "b9b6b98fb6dc466682918391bcc4f445"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"vocab.txt: 0%| | 0.00/232k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "04b5cdc2af5441cfbb826584ba89faeb"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer.json: 0%| | 0.00/466k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "08a39d8364424669a940c6f1f5354a1e"
}
},
"metadata": {}
}
],
"source": [
"# hide_output\n",
"from transformers import AutoTokenizer\n",
"\n",
"model_ckpt = \"distilbert-base-uncased\"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_ckpt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dCZAm71QYHZg"
},
"source": [
"The `AutoTokenizer` class belongs to a larger set of [\"auto\" classes](https://huggingface.co/docs/transformers/model_doc/auto) whose job is to automatically retrieve the model's configuration, pretrained weights, or vocabulary from the name of the checkpoint. This allows you to quickly switch between models, but if you wish to load the specific class manually you can do so as well. For example, we could have loaded the DistilBERT tokenizer as follows:\n",
"\n",
"```python\n",
"from transformers import DistilBertTokenizer\n",
"\n",
"distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w3S6sU0fYHZg"
},
"source": [
"> note: When you run the `AutoTokenizer.from_pretrained()` method for the first time you will see a progress bar that shows which parameters of the pretrained tokenizer are loaded from the Hugging Face Hub. When you run the code a second time, it will load the tokenizer from the cache, usually located at _~/.cache/huggingface/_."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZsmzsNOiYHZg"
},
"source": [
"Let's examine how this tokenizer works by feeding it our simple \"Tokenizing text is a core task of NLP.\" example text:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "bPUqHA_tYHZg",
"outputId": "4dacf9d5-f324-43ff-d010-e5e8bbd543c3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953,\n",
"2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
]
}
],
"source": [
"encoded_text = tokenizer(text)\n",
"print(encoded_text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tscNfTpEYHZg"
},
"source": [
"Just like we saw with character tokenization, we can see that the words have been mapped to unique integers in the `input_ids` field. We'll discuss the role of the `attention_mask` field in the next section. Now that we have the `input_ids`, we can convert them back into tokens by using the tokenizer's `convert_ids_to_tokens()` method:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "JxovGNpBYHZg",
"outputId": "3d030835-ff4e-4393-deff-7e2340675bcd",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl',\n",
"'##p', '.', '[SEP]']\n"
]
}
],
"source": [
"tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)\n",
"print(tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7yctsl7tYHZh"
},
"source": [
"We can observe three things here. First, some special `[CLS]` and `[SEP]` tokens have been added to the start and end of the sequence. These tokens differ from model to model, but their main role is to indicate the start and end of a sequence. Second, the tokens have each been lowercased, which is a feature of this particular checkpoint. Finally, we can see that \"tokenizing\" and \"NLP\" have been split into two tokens, which makes sense since they are not common words. The `##` prefix in `##izing` and `##p` means that the preceding string is not whitespace; any token with this prefix should be merged with the previous token when you convert the tokens back to a string. The `AutoTokenizer` class has a `convert_tokens_to_string()` method for doing just that, so let's apply it to our tokens:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "pZzlpzkHYHZh",
"outputId": "45ef5aa6-ac0b-42e8-a5d0-774f168f32c4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[CLS] tokenizing text is a core task of nlp. [SEP]\n"
]
}
],
"source": [
"print(tokenizer.convert_tokens_to_string(tokens))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XS8qLf5UYHZh"
},
"source": [
"The `AutoTokenizer` class also has several attributes that provide information about the tokenizer. For example, we can inspect the vocabulary size:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"id": "G6fLuP00YHZh",
"outputId": "46860806-bb09-4441-b950-6a66a56bf4d0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"30522"
]
},
"metadata": {},
"execution_count": 29
}
],
"source": [
"tokenizer.vocab_size"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bgWuNaX0YHZh"
},
"source": [
"and the corresponding model's maximum context size:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "CZZ9txKPYHZh",
"outputId": "22d31d74-a7dc-48af-896c-b803d34f32c3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"512"
]
},
"metadata": {},
"execution_count": 30
}
],
"source": [
"tokenizer.model_max_length"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KAdaaOlAYHZh"
},
"source": [
"Another interesting attribute to know about is the names of the fields that the model expects in its forward pass:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "4Ds8vIn3YHZi",
"outputId": "de42ddbb-ff70-49f7-ab92-b8cf691708ca",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['input_ids', 'attention_mask']"
]
},
"metadata": {},
"execution_count": 31
}
],
"source": [
"tokenizer.model_input_names"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WipZ-WIPYHZi"
},
"source": [
"Now that we have a basic understanding of the tokenization process for a single string, let's see how we can tokenize the whole dataset!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "98_IF_PVYHZi"
},
"source": [
"> warning: When using pretrained models, it is _really_ important to make sure that you use the same tokenizer that the model was trained with. From the model's perspective, switching the tokenizer is like shuffling the vocabulary. If everyone around you started swapping random words like \"house\" for \"cat,\" you'd have a hard time understanding what was going on too!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Dz6A3pVpYHZi"
},
"source": [
"### Tokenizing the Whole Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iX8aC6CDYHZi"
},
"source": [
"To tokenize the whole corpus, we'll use the `map()` method of our `DatasetDict` object. We'll encounter this method many times throughout this book, as it provides a convenient way to apply a processing function to each element in a dataset. As we'll soon see, the `map()` method can also be used to create new rows and columns.\n",
"\n",
"To get started, the first thing we need is a processing function to tokenize our examples with:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"id": "ObKaF8a6YHZi"
},
"outputs": [],
"source": [
"def tokenize(batch):\n",
" return tokenizer(batch[\"text\"], padding=True, truncation=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x7Uj3kOrYHZi"
},
"source": [
"This function applies the tokenizer to a batch of examples; `padding=True` will pad the examples with zeros to the size of the longest one in a batch, and `truncation=True` will truncate the examples to the model's maximum context size. To see `tokenize()` in action, let's pass a batch of two examples from the training set:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"id": "ksU6TzFTYHZi",
"outputId": "3d760246-09fa-4424-95dc-49d334b505c9",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0,\n",
"0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000,\n",
"2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300,\n",
"102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
"0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
"1, 1]]}\n"
]
}
],
"source": [
"print(tokenize(emotions[\"train\"][:2]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "33OJjfQAYHZj"
},
"source": [
"Here we can see the result of padding: the first element of `input_ids` is shorter than the second, so zeros have been added to that element to make them the same length. These zeros have a corresponding `[PAD]` token in the vocabulary, and the set of special tokens also includes the `[CLS]` and `[SEP]` tokens that we encountered earlier:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"id": "TfeURweZYHZj",
"outputId": "aabd0372-b884-44f3-9525-fae2660926b1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 0 1 2 3 4\n",
"Special Token [PAD] [UNK] [CLS] [SEP] [MASK]\n",
"Special Token ID 0 100 101 102 103"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
0
\n",
"
1
\n",
"
2
\n",
"
3
\n",
"
4
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Special Token
\n",
"
[PAD]
\n",
"
[UNK]
\n",
"
[CLS]
\n",
"
[SEP]
\n",
"
[MASK]
\n",
"
\n",
"
\n",
"
Special Token ID
\n",
"
0
\n",
"
100
\n",
"
101
\n",
"
102
\n",
"
103
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": 0,\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n \"[PAD]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 1,\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n 100,\n \"[UNK]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 2,\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n 101,\n \"[CLS]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 3,\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n 102,\n \"[SEP]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 4,\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n 103,\n \"[MASK]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 34
}
],
"source": [
"#hide_input\n",
"tokens2ids = list(zip(tokenizer.all_special_tokens, tokenizer.all_special_ids))\n",
"data = sorted(tokens2ids, key=lambda x : x[-1])\n",
"df = pd.DataFrame(data, columns=[\"Special Token\", \"Special Token ID\"])\n",
"df.T"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cZFUVRffYHZj"
},
"source": [
"Also note that in addition to returning the encoded tweets as `input_ids`, the tokenizer returns a list of `attention_mask` arrays. This is because we do not want the model to get confused by the additional padding tokens: the attention mask allows the model to ignore the padded parts of the input. <> provides a visual explanation of how the input IDs and attention masks are padded.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "l5TJDYrVYHZj"
},
"source": [
"Once we've defined a processing function, we can apply it across all the splits in the corpus in a single line of code:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"id": "p2C8EjOiYHZj",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 113,
"referenced_widgets": [
"c9af961359e246f7a54d754004595fa6",
"b14596b44c49417b842f5098799db93b",
"e9bfa8a56003452994a71f3a80dc17fd",
"4d3f1e6fc84046bfbcc7cc3d3ba2a16a",
"87577e2d0dcb4aedbb3e025730182f10",
"1a2a0df350024a35898040131c6c736c",
"779ab3f393394999b3a3053297f3ebb6",
"31773d4170864ef488cb7761bc93b331",
"6fc803568b194fbdbfe99779afb2997b",
"cc6b6ce429334e60a7ef197eced505f3",
"39de634706494a7ca343e9e66c3f6320",
"941de4e3de8a452890d66674e4049a8e",
"66f84b92e8404f1b9f818b5ba3ebd8f7",
"03abec44beaf477dbc83aa28f95d1b29",
"f91ed07c9df642a5a2d81c4130418323",
"4fda35b9ac3f4488b371cd45c7933c52",
"5e54039b50044ffab816d1f0af8c77bd",
"c3553be35b344f4399f6f0f3014cc561",
"4926b94c9cad45ebbce03889e5400907",
"47a3e6e78e5f4cc4a7bbf007f3d8a147",
"0d308754e8604c3690fdecdfc47604c7",
"d9c938b452494f4da29ea57cc7a349b4",
"6c4212d1f6d7429ea3b6b207cad2bcd5",
"0bbc3c21ca0d49b2954b2f395ac03e54",
"609bd72b423f4e73aa1ff5c09ea90206",
"8986ebaa4efe48c9aaaa75a5a96fa9fe",
"9b998d70cb2640edb2a3f57cd2ff4016",
"c92bd13e557148a6a6a7414c806e6c97",
"ecb275c5402f4cdea292159f993d427f",
"fa44e411e7c449ea83eb82a161ca6930",
"035141ac5a8b4c2480827c0d2ebfd5b8",
"4bf65b03cf5a4241a20983dd4ff10933",
"523cadaed4fa4c958b0ff4bba01e08af"
]
},
"outputId": "ba831434-b827-4a7a-ba59-894dd3d8cfe1"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/16000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "c9af961359e246f7a54d754004595fa6"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "941de4e3de8a452890d66674e4049a8e"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "6c4212d1f6d7429ea3b6b207cad2bcd5"
}
},
"metadata": {}
}
],
"source": [
"# hide_output\n",
"emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DxR77dMXYHZj"
},
"source": [
"By default, the `map()` method operates individually on every example in the corpus, so setting `batched=True` will encode the tweets in batches. Because we've set `batch_size=None`, our `tokenize()` function will be applied on the full dataset as a single batch. This ensures that the input tensors and attention masks have the same shape globally, and we can see that this operation has added new `input_ids` and `attention_mask` columns to the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"id": "v4bl2_kVYHZj",
"outputId": "789f0d37-630c-482b-aaec-78c3b7477dc8",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['text', 'label', 'input_ids', 'attention_mask']\n"
]
}
],
"source": [
"print(emotions_encoded[\"train\"].column_names)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h9D-surbYHZj"
},
"source": [
"> Note: In later chapters, we'll see how _data collators_ can be used to dynamically pad the tensors in each batch. Padding globally will come in handy in the next section, where we extract a feature matrix from the whole corpus."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "twowzyHXYHZk"
},
"source": [
"## Training a Text Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y2t6_T6vYHZk"
},
"source": [
"As discussed in <>, models like DistilBERT are pretrained to predict masked words in a sequence of text. However, we can't use these language models directly for text classification; we need to modify them slightly. To understand what modifications are necessary, let's take a look at the architecture of an encoder-based model like DistilBERT, which is depicted in <>."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cTC0nmqqYHZk"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w9l0j168YHZk"
},
"source": [
"First, the text is tokenized and represented as one-hot vectors called _token encodings_. The size of the tokenizer vocabulary determines the dimension of the token encodings, and it usually consists of 20k–200k unique tokens. Next, these token encodings are converted to _token embeddings_, which are vectors living in a lower-dimensional space. The token embeddings are then passed through the encoder block layers to yield a _hidden state_ for each input token. For the pretraining objective of language modeling,footnote:[In the case of DistilBERT, it's guessing the masked tokens.] each hidden state is fed to a layer that predicts the masked input tokens. For the classification task, we replace the language modeling layer with a classification layer."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SgweRPnSYHZk"
},
"source": [
"> note: In practice, PyTorch skips the step of creating one-hot vectors for token encodings because multiplying a matrix with a one-hot vector is the same as selecting a column from the matrix. This can be done directly by getting the column with the token ID from the matrix. We'll see this in <> when we use the `nn.Embedding` class."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vq8v9hXgYHZk"
},
"source": [
"We have two options to train such a model on our Twitter dataset:\n",
"\n",
"- _Feature extraction_:: We use the hidden states as features and just train a classifier on them, without modifying the pretrained model.\n",
"- _Fine-tuning_:: We train the whole model end-to-end, which also updates the parameters of the pretrained model.\n",
"\n",
"In the following sections we explore both options for DistilBERT and examine their trade-offs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x1Us9Co3YHZk"
},
"source": [
"### Transformers as Feature Extractors"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mn7Zc8JbYHZk"
},
"source": [
"\n",
"Using a transformer as a feature extractor is fairly simple. As shown in <>, we freeze the body's weights during training and use the hidden states as features for the classifier. The advantage of this approach is that we can quickly train a small or shallow model. Such a model could be a neural classification layer or a method that does not rely on gradients, such as a random forest. This method is especially convenient if GPUs are unavailable, since the hidden states only need to be precomputed once."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KiWbScK9YHZk"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sNX2G9a4YHZl"
},
"source": [
"#### Using pretrained models"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pl1CppEaYHZl"
},
"source": [
"\n",
"We will use another convenient auto class from image:images/logo.png[hf,13,13] Transformers called `AutoModel`. Similar to the `AutoTokenizer` class, `AutoModel` has a `from_pretrained()` method to load the weights of a pretrained model. Let's use this method to load the DistilBERT checkpoint:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "QlvjQYx0YHZl",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 49,
"referenced_widgets": [
"63d0481a7f174a02bee6cb4a04582bcb",
"489a4b0c44134749acf0b0bba552f6b4",
"15e644ebc48541489d80226c038b2e72",
"b7397401da6a4262b4a0c1ddff68f2bd",
"59107720afb64b9cbbcf5460f10523ed",
"2f7955f408f8485a9b9c9cfa204a9c07",
"b3d77b4d234048b4afa2bff8627f4dbc",
"05f16d8e9442490c8a6621b44aa6487e",
"832a19ab360449f69e101aca57d006bc",
"657c96820d7c4ab99ab35eab6d74d419",
"de9fb37d4c84461db6f69f142bf3392c"
]
},
"outputId": "6255baff-8d46-427a-fd18-545b65349921"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"model.safetensors: 0%| | 0.00/268M [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "63d0481a7f174a02bee6cb4a04582bcb"
}
},
"metadata": {}
}
],
"source": [
"# hide_output\n",
"from transformers import AutoModel\n",
"\n",
"model_ckpt = \"distilbert-base-uncased\"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"model = AutoModel.from_pretrained(model_ckpt).to(device)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4cGdsDAGYHZl"
},
"source": [
"Here we've used PyTorch to check whether a GPU is available or not, and then chained the PyTorch `nn.Module.to()` method to the model loader. This ensures that the model will run on the GPU if we have one. If not, the model will run on the CPU, which can be considerably slower."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qsj8YjFpYHZl"
},
"source": [
"The `AutoModel` class converts the token encodings to embeddings, and then feeds them through the encoder stack to return the hidden states. Let's take a look at how we can extract these states from our corpus."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J52ZF5wmYHZm"
},
"source": [
"#### Extracting the last hidden states"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z3Oyg-UEYHZm"
},
"source": [
"To warm up, let's retrieve the last hidden states for a single string. The first thing we need to do is encode the string and convert the tokens to PyTorch tensors. This can be done by providing the `return_tensors=\"pt\"` argument to the tokenizer as follows:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "zS1-aBssYHZm",
"outputId": "682cd43b-a182-42b0-925f-6a7d68b70778",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Input tensor shape: torch.Size([1, 6])\n"
]
}
],
"source": [
"text = \"this is a test\"\n",
"inputs = tokenizer(text, return_tensors=\"pt\")\n",
"print(f\"Input tensor shape: {inputs['input_ids'].size()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GvsD_17uYHZm"
},
"source": [
"As we can see, the resulting tensor has the shape `[batch_size, n_tokens]`. Now that we have the encodings as a tensor, the final step is to place them on the same device as the model and pass the inputs as follows:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"id": "qEhCVfn5YHZm",
"outputId": "ba2e7200-9486-4804-cda5-d91a3508a252",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"BaseModelOutput(last_hidden_state=tensor([[[-0.1565, -0.1862, 0.0528, ...,\n",
"-0.1188, 0.0662, 0.5470],\n",
" [-0.3575, -0.6484, -0.0618, ..., -0.3040, 0.3508, 0.5221],\n",
" [-0.2772, -0.4459, 0.1818, ..., -0.0948, -0.0076, 0.9958],\n",
" [-0.2841, -0.3917, 0.3753, ..., -0.2151, -0.1173, 1.0526],\n",
" [ 0.2661, -0.5094, -0.3180, ..., -0.4203, 0.0144, -0.2149],\n",
" [ 0.9441, 0.0112, -0.4714, ..., 0.1439, -0.7288, -0.1619]]],\n",
" device='cuda:0'), hidden_states=None, attentions=None)\n"
]
}
],
"source": [
"inputs = {k:v.to(device) for k,v in inputs.items()}\n",
"with torch.no_grad():\n",
" outputs = model(**inputs)\n",
"print(outputs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kn6hq8P-YHZm"
},
"source": [
"Here we've used the `torch.no_grad()` context manager to disable the automatic calculation of the gradient. This is useful for inference since it reduces the memory footprint of the computations. Depending on the model configuration, the output can contain several objects, such as the hidden states, losses, or attentions, arranged in a class similar to a `namedtuple` in Python. In our example, the model output is an instance of `BaseModelOutput`, and we can simply access its attributes by name. The current model returns only one attribute, which is the last hidden state, so let's examine its shape:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"id": "SPZzRlEmYHZm",
"outputId": "eb71864e-2793-4d20-8110-260ceb890327",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"torch.Size([1, 6, 768])"
]
},
"metadata": {},
"execution_count": 40
}
],
"source": [
"outputs.last_hidden_state.size()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D8ixQ-OIYHZn"
},
"source": [
"Looking at the hidden state tensor, we see that it has the shape `[batch_size, n_tokens, hidden_dim]`. In other words, a 768-dimensional vector is returned for each of the 6 input tokens. For classification tasks, it is common practice to just use the hidden state associated with the `[CLS]` token as the input feature. Since this token appears at the start of each sequence, we can extract it by simply indexing into `outputs.last_hidden_state` as follows:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"id": "jaF5mLDpYHZn",
"outputId": "67804734-f0dc-4c24-91cf-53b19bf81f64",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"torch.Size([1, 768])"
]
},
"metadata": {},
"execution_count": 41
}
],
"source": [
"outputs.last_hidden_state[:,0].size()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-tJQqquaYHZn"
},
"source": [
"Now that we know how to get the last hidden state for a single string, let's do the same thing for the whole dataset by creating a new `hidden_state` column that stores all these vectors. As we did with the tokenizer, we'll use the `map()` method of `DatasetDict` to extract all the hidden states in one go. The first thing we need to do is wrap the previous steps in a processing function:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"id": "NWt8kNZoYHZn"
},
"outputs": [],
"source": [
"def extract_hidden_states(batch):\n",
" # Place model inputs on the GPU\n",
" inputs = {k:v.to(device) for k,v in batch.items()\n",
" if k in tokenizer.model_input_names}\n",
" # Extract last hidden states\n",
" with torch.no_grad():\n",
" last_hidden_state = model(**inputs).last_hidden_state\n",
" # Return vector for [CLS] token\n",
" return {\"hidden_state\": last_hidden_state[:,0].cpu().numpy()}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Iy9SPhhfYHZn"
},
"source": [
"The only difference between this function and our previous logic is the final step where we place the final hidden state back on the CPU as a NumPy array. The `map()` method requires the processing function to return Python or NumPy objects when we're using batched inputs.\n",
"\n",
"Since our model expects tensors as inputs, the next thing to do is convert the `input_ids` and `attention_mask` columns to the `\"torch\"` format, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"id": "HaRjENL_YHZn"
},
"outputs": [],
"source": [
"emotions_encoded.set_format(\"torch\",\n",
" columns=[\"input_ids\", \"attention_mask\", \"label\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "89WQ_1W-YHZn"
},
"source": [
"We can then go ahead and extract the hidden states across all splits in one go:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"id": "W0VfwuXPYHZn",
"outputId": "91cfc0df-0760-4181-91cd-0ea49730580e",
"colab": {
"referenced_widgets": [
"e039fd5ac4b242c5883403e93933622c",
"43418cc35a554cae9aee68ba130c9f03",
"5d3d265ec0cf4de4be9c81f79409dc34",
"328cf7c5a06f41b28cf48226963de556",
"6c10079c765b49aba7ab56b042a493d4",
"693e1648d7344c1db425c6bcb7630ef9",
"04ad28b29bb248a2a55c956128897e50",
"0015eaa549e94f9b8885141c97cc7dd2",
"847805d47b4f460586c35fc1bf5523d2",
"25f7279b981346e69c23f1d951ba43a6",
"aaadd88cd2184331884f04a8b48d0e2b",
"3f24b9948d954bfd972ea2c0ba125d71",
"a0086a30224142b49d59fccf50911434",
"b29d9b27e0134bb28ecae6aa60dc9ebb",
"39d32faa742b4abbb6595a2511b27457",
"137c3043c9d74052becc11ce6e5197d4",
"c60a0b21d1d7491d8efedb962c23a4ad",
"ba18938f845d431b808b60d784aae3ce",
"431c83ea2b8846bea3543102eb5cfdf9",
"a26e0e64f933455eb2951bbe641717c3",
"78a1cdd8043f47a7bbfff983fc7c96d6",
"06ff32df819243b6a2a997fec72cdefb",
"65c7d094cdfb44c1a5382684b02e4f04",
"e0d4a186cc9942b58657e5500bae4268",
"914326d31bc849e3a8a96a4654cdc8f9",
"97b41b8531fc431d8200d895156d0adc",
"4fc7b42f8d194b77a358e8ddaf62f728",
"14567f6276dd4a4c88fcf5431aafdc8d",
"b337f995be404997abf6d2cc8a6b0a25",
"ec9176e36fb5428e95f22ee5b2dc0e3a",
"b325c2be1f7e41fa866fac9731e620fd",
"2394597be68a4d6195dc36cf002ac2b1",
"61d5e56e3df24ccf93ca108fb6a8b6c7"
],
"base_uri": "https://localhost:8080/",
"height": 113
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/16000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "e039fd5ac4b242c5883403e93933622c"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "3f24b9948d954bfd972ea2c0ba125d71"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "65c7d094cdfb44c1a5382684b02e4f04"
}
},
"metadata": {}
}
],
"source": [
"#hide_output\n",
"emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9iJB5YtiYHZo"
},
"source": [
"Note that we did not set `batch_size=None` in this case, so the default `batch_size=1000` is used instead. As expected, applying the +extract_hidden_states()+ function has added a new `hidden_state` column to our dataset:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"id": "rzbHsPeXYHZo",
"outputId": "3ce4b4a8-ae42-4b31-8f08-8eb31456e328",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['text', 'label', 'input_ids', 'attention_mask', 'hidden_state']"
]
},
"metadata": {},
"execution_count": 45
}
],
"source": [
"emotions_hidden[\"train\"].column_names"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gGipAIQcYHZo"
},
"source": [
"Now that we have the hidden states associated with each tweet, the next step is to train a classifier on them. To do that, we'll need a feature matrix - let's take a look."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TyPQKsoqYHZo"
},
"source": [
"#### Creating a feature matrix"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hdNTPLd9YHZo"
},
"source": [
"The preprocessed dataset now contains all the information we need to train a classifier on it. We will use the hidden states as input features and the labels as targets. We can easily create the corresponding arrays in the well-known Scikit-Learn format as follows:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"id": "5JRjXGPnYHZo",
"outputId": "3f00495b-b804-4797-d901-f7c978dea0ca",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((16000, 768), (2000, 768))"
]
},
"metadata": {},
"execution_count": 46
}
],
"source": [
"import numpy as np\n",
"\n",
"X_train = np.array(emotions_hidden[\"train\"][\"hidden_state\"])\n",
"X_valid = np.array(emotions_hidden[\"validation\"][\"hidden_state\"])\n",
"y_train = np.array(emotions_hidden[\"train\"][\"label\"])\n",
"y_valid = np.array(emotions_hidden[\"validation\"][\"label\"])\n",
"X_train.shape, X_valid.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q-qVrq09YHZo"
},
"source": [
"Before we train a model on the hidden states, it's good practice to perform a sanity check to ensure that they provide a useful representation of the emotions we want to classify. In the next section, we'll see how visualizing the features provides a fast way to achieve this."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KqBUP2iBYHZo"
},
"source": [
"#### Visualizing the training set"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pt3FgqeYYHZo"
},
"source": [
"Since visualizing the hidden states in 768 dimensions is tricky to say the least, we'll use the powerful UMAPfootnote:[L. McInnes, J. Healy, and J. Melville, [\"UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction\"](https://arxiv.org/abs/1802.03426), (2018).] algorithm to project the vectors down to 2D. Since UMAP works best when the features are scaled to lie in the [0,1] interval, we'll first apply a `MinMaxScaler` and then use the UMAP implementation from the `umap-learn` library to reduce the hidden states:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"id": "3eAiWRrUYHZp",
"outputId": "50fd0aff-d43c-451a-9072-15bde2fedfc4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" X Y label\n",
"0 4.484186 7.000147 0\n",
"1 -2.808481 6.812856 0\n",
"2 5.322850 3.481357 3\n",
"3 -2.365859 4.384464 2\n",
"4 -3.212201 4.794142 3"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
X
\n",
"
Y
\n",
"
label
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
4.484186
\n",
"
7.000147
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
-2.808481
\n",
"
6.812856
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
5.322850
\n",
"
3.481357
\n",
"
3
\n",
"
\n",
"
\n",
"
3
\n",
"
-2.365859
\n",
"
4.384464
\n",
"
2
\n",
"
\n",
"
\n",
"
4
\n",
"
-3.212201
\n",
"
4.794142
\n",
"
3
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df_emb",
"summary": "{\n \"name\": \"df_emb\",\n \"rows\": 16000,\n \"fields\": [\n {\n \"column\": \"X\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 15996,\n \"samples\": [\n 5.056397914886475,\n -0.5000765323638916,\n -0.4390697479248047\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Y\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 15989,\n \"samples\": [\n 5.140077590942383,\n 5.7701873779296875,\n 9.122005462646484\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 5,\n \"num_unique_values\": 6,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 47
}
],
"source": [
"from umap import UMAP\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"# Scale features to [0,1] range\n",
"X_scaled = MinMaxScaler().fit_transform(X_train)\n",
"# Initialize and fit UMAP\n",
"mapper = UMAP(n_components=2, metric=\"cosine\").fit(X_scaled)\n",
"# Create a DataFrame of 2D embeddings\n",
"df_emb = pd.DataFrame(mapper.embedding_, columns=[\"X\", \"Y\"])\n",
"df_emb[\"label\"] = y_train\n",
"df_emb.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jeb2_7JUYHZp"
},
"source": [
"The result is an array with the same number of training samples, but with only 2 features instead of the 768 we started with! Let's investigate the compressed data a little bit further and plot the density of points for each category separately:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"id": "4E3ucjBOYHZp",
"outputId": "6cb3e722-857f-424e-9179-94555950842e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 485
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/svg+xml": "\n\n\n",
"application/pdf": "JVBERi0xLjQKJazcIKu6CjEgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSID4+CmVuZG9iago4IDAgb2JqCjw8IC9Gb250IDMgMCBSIC9YT2JqZWN0IDcgMCBSIC9FeHRHU3RhdGUgNCAwIFIgL1BhdHRlcm4gNSAwIFIKL1NoYWRpbmcgNiAwIFIgL1Byb2NTZXQgWyAvUERGIC9UZXh0IC9JbWFnZUIgL0ltYWdlQyAvSW1hZ2VJIF0gPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9QYWdlIC9QYXJlbnQgMiAwIFIgL1Jlc291cmNlcyA4IDAgUgovTWVkaWFCb3ggWyAwIDAgNDkyLjQ4IDM0OC4zMjA2MjUgXSAvQ29udGVudHMgOSAwIFIgL0Fubm90cyAxMCAwIFIgPj4KZW5kb2JqCjkgMCBvYmoKPDwgL0xlbmd0aCAxMiAwIFIgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnic7X1LkxS31m2P9StyaA+o1vsxNNd2RTD6wB3hgcODE7jBJozPxcS14/77b+uRVSXIBHWTVK3MEkTbiFq1UUp7r/3QI2+/v//nj5f3L/ZPh//zE7s9tl6+Z2J4Qz+vBz68oZ9/BzHs6ec149R6y3SQO+3pj3+Of1Ta75TkVhr6O143f2fsFbv9jr7+nr6yZ24nB+H5zkkSJYzbhUP7z7GtpI5i/0zgQyNJeseeD++GEyHC5P9ptXN2+Pt++Hn4a7j9TsZ/j54j9p2eJfZcDE+M2lnupBHcczFISf8Yl0p5Eczw8u1w+z98+P6/Bzj9iJ33PAThlDk/gu+CCZpTD7mLs3Ha3NfN103ivJBSWm5tSOJOmvu62SYueKWC4FyZ3Ltjc183W8RFfRv1DGUCRoTdecND/j08EYrvbDi0F59zrrki+T5P0klzXzfbJskJo7xVPmRxJ8193WwTp7nzWgnlVBJ30tzXzTZxgnhCcGdNopvT5r5uNvYuCKWD9lnasbWvWo3GYkxQRA4q295Jc183sbV7Md2NFu2DdEoZW+z90NzXzcYBtlpL0hWr8wAfm/u6ucyQ8Z0RTnhjg/PpHzxp7utmW/+lIcW0TupsCifNfd1sEpfcLX2xVdNlYn47qvqxua+bXTsfp53kWr2UzvLi3I7Nfd1sVHavpQxCy9y7k+a+braJs9Zprr0uunfS3NfNNnFGOyVdMLxYxrG5r5tt4qjNrdLKZeU8ae7r5gUeFl7XSbmsDsoaK0sEcGzu62ajJi8ZRF4gglnUT1DIo5WSFAf4PBwnzX3dbDQdRSGK8pTtZNM5Nvd18wK9e5iu09QIaaQJ5LfyxB2b+7r59QxtSYdA9MwDsXSemZPmvm42zoxxlDZ6XljopLmvm41jvbCRcOGDMJ7nhz1p7utmI+WKoCWpWlHDk+a+bjYaibfZX2YbObT2VesinnRhLdmSCbWFSVZ7KaRUJUw6NPd18wJJF1nl+CvHNSfNfd28iFVoKyR3UubOHVv7qrWJcNULF4Iq4eWxta9aIFWn9jgNuf61MBVdIl1fNilZeDgWDge009QDpcronjT3dfMCRvKoFKLZihbm6KWTJ8e1kz6UcTy09lXrIs5tade7cJrlvaB/3eo8cifNfd28iENa2HptXENyzso8difNfd28eNS3cMD74AxzYeuUggsdKO7K4o7Nfd1spTlJiZI2WheaOzT3dfMyvZPk7VJaWMQdmvu62SguaG2VtYV3T5r7unmBPOprFekWLTcv62EXNoyFJ3fhh104516y8t/Xob9GePEVspslNRx50qHUos/Mdc9uR6xgkpb3192i16sKy24hXLqouGxJZNFVMGRFPW8kuGxusvQaj3VaKxnyXpNja1+1GjeuSGu1MsFn7T5p7utm47In19I640OuJ5w093Wzq+MD1NE74zTP0z3+eX/y5/MvSXkdNyGN+eexta9aF1nI77nx5Us70Jv4LTdceq/KDuyT5r5uXmAnIbK+nVcjly3nLbodZ+FloyVXA7uGLRkFLruasbDeLLw4srAarlnJSK/UzlCYkH9TW+qdo+EUSkolltezJVfcl1WyJXlz0aNKSy6Ow2yQF353QOuHKtHCZYplg/JFD+8svlGKBl4ZrV1m0mNrX7UaZVkX6Au8WPNJc183u2Y+TjOXPi267GasJQlTLXuoaLmuwavl8gvui7nopn9t2a3NNlgRlHPFAE+a+7p5kfx54QPRi5Zuz7Qr5xNqvDB3LqtYC6cgsTInvPAm7784ae7rZiOzL3vsTVPwz2UIvhxeODb3dfMiIczSZ4Evv7et3UYWfvRl136W3hW8rI0svHy4iPNbdkvqwgO26Lmeh9c4m8PyB25SXnjzJPSeZ/JxhnycUaV2cGzu62ajgS973cwSa31N7ixeICADt3kUTpr7utloF8tez/CIJZ92f7Gwdi58R8zCh30XLgstuO7rZHCeE/UWz3hs7uvmFk7zGKeE1cKO2fexua+bF9GSR2SI7ea2bL1m6aXqhVOYhdd4lj3wtrBOL33aB/hhv2oK06RYypJmextKGHts7uvm+aKMxY1n6Zod8AEj+Arq2SPfB6csC2f925hdMA2Bvi+iq8XXmPQ+M9c9d31mVj13C4dgC683rdudQGrF4hn10geclq3WLnyhGcA1KFD6tuROiWWvi1tYj5ZJoBe9n/sqCG1xn7J0nXfZy3H72aiL69vCu6ew7/tculi/7EXcV6FvS9/guvDC7MI3+q1ht97XRTxn5d1NcngWh3D4d4ivxajfT1W9Q6p6RxT76YN3Sr396J1SBPnkO6jK5wU/+X0e+5aGN75fKvbwNXuXHu5JfDqjdoo8pTSDUjx+6eVb9vSO3f4oBkFChrtX6U1ad7+xX4Zvbt7f/Ofmt5u/bu7pT++/HX4d7p6xH+7SQIj8EivShfIiLj7Op4vLmsdnUFLs7PERcvPQ44KeepFWLehT79KK+uetdNqI5BzTa34kzZ1gf+eRmEcML/af+vTvo4I+MXonlAiGc2fzG7qEMVLww0kj8Vk9WgyR4qegjQs89dlaE5yOrzWLhRshtPDWG+r+MItkFfLFfh75oczzPaOguMkrCp3yumIQkuYlpEkSMsSpkOUZp5GsQuZnbJPZ1r95tWlXOTavcl93fNsRnz/MueCkexe04hTipnGjyZEuULibslJB2hltMA/wDJJVyBf7eeSHMpv656Tl0mvDbVrPdYoyP8mTAolgOY+rabl/M0hWIV/s55EfymwbPzJU4ax28VlJ04MUhnQtaXvwPLZtGb9pJKuQefzaZLb1z8hy8CAFqsGQZGlVMgkyxMA516V/00hWIXP/2mQ2kk7Q9HTEhCnylV7Q09tMEFoaSYzvRtKZRLIKWUinSWYnna9BOnHEOf0lZfRp+N2prowur4zeNJCdAvNANkls6pwSguI4qV2ko3i4i1wTGX1kBZquoEh0ceQzSFYhU/caZTb1TzgKcSzpbDQnykBc2tgdH9UpISkooGHI/ZtBsgqZ+tcos6l/kghSasuTD6eIJRhtlfTpWSPlcy/K+M0gWYVM/WuUuRG+60HWxvjOOuuFtSmAoD9ox433KdqhZCUvDebRmwayCpgGsk1im73qgyRjD2XRWDDhOZQofZvEsQqXbbVFXlPPeNYPHUWR+kplJLntyFAxCEuvj81dmwayCpj61iaxbdiEsNLK4CKLk/krI5x3iZJMjsZkCUtmkKxC5qFrk7mJsJM4MVh6Xu/T1wVl8YqniQnOK22dD2NYN41kFTLTS5vMtv6psoiUZFlug+QUMyZa81mtS/emgewUmHvXJPGKOLiRYeP2b6EsubA0VsJoGkftsnc1Ji728TEDnQSyCpg1uUliW7wYYmE/FvEiL1vrjSA7ThaiuDLOGF00ZQbJKmSOF9tkbiJePOVkqWwwKZQf0tU2RCsUWKmPWf4IZBXwI5afl9g2eORWday7xWej8J2CeO6diw8qdbBOybF3M0hWIfPgtcls6l/yX4mS49c1xZxOF3q2lGSQCpVYewbIToGpd20S21yQEmkPTPxuMPRcxMjJWRhDsYCKp2yy1U7hWIXLRtsir8fYX8TvK3DMF/d97X4LvHRK5iS51SapLFEPRSjEjyrRuDMiHo0p/ZtBsgqZ2a1N5gZcV8sTxKPAObWIjty7cgg/zqOmf81Yp4sxzSBZhUxP0CizLTNN6ZmwKRCK+380aXYyUu6MVtKMHD0NZBUwZ6ZNEreRwkAv+fHTbSKJaDUn5dZZ1mFD4d8f7C85RbIK+WI/j/xQ5jbmdz2eBt1pS5k270kf6/OGHEHKhVMB3IyjU6ol00hWIXO1pE1mm6fRNv+Kaq0tz3oTnzWKDNILXvo3g2QVMnuaNpkrL9JJz31cFNWpZuWdJW+kbTQKR7rrpbXjyM0gWYXMvWuT2ebhxOGJ4tcpo+HxKHQkB1Jmyqp9qfZPA1kFzB6uSWIbA+bbO3kiEXpOIlDy3sncjLeWuL/M6zSQVcDMf00SGxf+uRSBBj6xu4x3+BKVykxuhscT+eNCyTSSVcgSvTbJbOsfclUIOrCCXn9FiFra/S745gT0DBQ9PyI6MCEon8IeqyXlsKQ7LgWa1jgvYzkqF3enkaxC5uJum8w270tfyDuHo6d0pMiGUgadPDnBHckave80klXI7H3bZLZtztLcBcm5Su5IK/LcolihjJ5JEVeUzVnTSFYh8+asNpkrL1Gie19HT6M9jbbNgoUmYclX6nyDkiqGOw1kFTDPa5PElYfL6A4DIZFs9r7YsX38OxV0qi0GZfLyQH5qQ38fSawY8DSSVchswG0y1597kKOOa+MyVb6sOxznjAybt0iMy+8zSFYhc1zQJrMtvM8xYwp3PT+pMvlw+Eey5k0CWQXMmtcksY2ahZBW+5ACNB/3HqZfiSnyOw7HHW4zSFYhMzm3ydzE6hvlByZwQ94wPitFaMp6lbykCkRkxqgxZpkEsgqYR69JYltE6tIKsUzlEGso8CF/ZJLWSHJTKRoqEek0klXIHJG2yWyjFRkNyblEUZ4URlrKDnIMJJ2xVsnSvxkkq5CZWNpkbq6oi7wFCdp9AAenq/cu0JUt+M3dyCaNEPtjU/di5L6FMAleXcDHD702iz5+a6CD8yDOeh4Y24Whb8m55MZOgIU1zLFBQ5zRns+D6Mp36Rm4YuXbIgJqkjj+qlYPZSHUdoOqj72JuUn5wPN46H1+12K5Z00xoRfgTDgcxU+3SnJvnErlYE8mq6TQ4/0mM0hWIfOhyzaZbfV55MsmoM+rWnV4qV9c8HHlSRMH27gXxI2b/WaQrELmsWuT2bnmUlyDPunkLPPrIW0WrLnMu5O84NalqLgs908CWQXMq/1NEtvM2aSlOp62IDmT38SY7mWi51RCCu9K8DODZBUyG3SbzE3swepVkY0xDr7KoZusDiLzZZIlBakRxeZpd7zRjvIGkld2/k0jWYXMO//aZK5/+wNZd46G0nl2IcrLLdODlk1d4+BNI1mFzIPXJrPzycUYB/ykgKKpUjreyx2VXUpO1qdMCjh4fQZkBskqZI5h2mRuoIKzgguO0fdcddK5RGKFvhSPHqihjx/8mk83+4IgS1exxOjyb2pLTWE619GtL/6yDI59vBw7Wmp6Auj1EOS53wYlfP6VWe2qhHyIw5Arzb9iwGqtM8RbISXJTtLoi3g/TK5UTCNZhcyVijaZbXkN+Eks6IUg+B0Z6NHXNVEZhy+UO2dO1lEMCeLWpAWWoCgS0DL4ck/wDJJVyJyjt8ls6h/6TZ+nVzYrXgwjyqU5qk4sTwNZBfzoEuh5ietfgoPeK9GjQRwKhVYU6EvgK1EiLewbky51NOlQurZuonMHIKuAH3duVmJT5+DvXwTfX4Fd/e889hGPoe+/q2zbxVd1OEp84z9ipFQUGVk5wRYHIKuAH7PFrMS2OJAiXElZr4+ytE4n/3UK4Jzgij6hOKnEgdNIViFzHNgms43NwN+q1NfivmwtDrrghJ5kYpeuERxRuxMBry2iv0QTfU8Hup+DvuEHm6TBTyuhF7bRnQj2qgVAhNDs49DvlxEhaPJA1karMyJfApJP8ZjybtZC0ZNAVgEzQzdJbHNw4G8BR198xL2BGfvSYHB2xr0vHTuib3oCwTNL5SU08ggxYk/mZ5L35+M60wySVcj8BG0yH+Fc0AsN4Ls70d8ajH5MCDqBopHOv9LXdQhxvZdn+j00cpVwGskqZK4StsnchHFQDHxyBxwXVlEaJrJfoHRRcqPGO6SnkaxC5v61ydzGmhY4OSOc8Wv3dOAXmqOHjNiBGXi1Cz1MQM0HmhQT+S4SdNZZ0VoP9CY09FARuhyLXuFEd839CB/GAb0FT/Sg162xz1V0pYRVbPScFlkt0M0Oe/RWajJNww6wnt3VYoOq1REXnwL08L6Nn4BT9+5Tr9CmoKvFCOsqnxw86LNW6C9Ovh6bXzJtBF/qRj/NpiX5omB5Oi9lvTOSHjZF814LqQ1RVYkhZpCsQmZKaZO5AZPFn99tEMZZKQV+yQZ5gyv6vmXoWxKwE7pOJo8gE9JVp9NmtKjI5Q6dlPBZI4VSLoz+dQbJKmQ+L94mcxPrSQibptccf16LyZ43QgC/DcxRRKE9Oc0UbTgutJRJ2QOJEWnfUuneJJBVwNy7JombKBKgbzrpFv2c3X4nh9fvBzk8i4M7/Evg/XD7/f0/f7y8f7F/Orx8z4SLmjAIH6taw9uxraTeaT/8ydhPjDKpnT0icvMU8KGM6guniPKlKRk89nF4TQ/0hn5iT1+zd+khn8SnJC+0I2N3ZlCKxy+9fMue3rHbH8UgSMhw94rF4bj7jf0yfHPz5ua/N///2+HX4e4Z++EuDYSIA0H/IdAb+slDQW2mlN6Zk8cjd7KT/tj70j70teAP7d8Ze0U9fT68G2pRwuT/abWL5znuh5+Hv4YyI0ntYuafdCspe0llDft7HIRZSFTST318oqVPDHWJuMcE78jq47A7HZciKRvIqiI/q0xnRMT6vg3E4ibki18o4/NSZLJX9TbpGSSrkNmc22Q29s8bSl2I5ZPL8ZbioLyRlBLC9JoaNXZvCsgqYOldi8TGzp1WRERyjSbtB6dn1spSej3y3zSSVcjSvSaZbdP/CZX9rMKzJoW/tAbPhFET75g4s1mVVbM4ysZwJ63N125KJZ2TXHA/asYkklXIohlNMh+uuTHz1pxy7mivlqgq3ddiJzT3iGQV8mPNnZfZ2L989RjPYWep28ZcyPFDfFD6N4lkFbL0r0nmUjrSbe/BtodO96QginQ5uTLlU+JhfJwfTv5epX+n9G4aySpk7l6bzIcbtTUne9yVH1/v9rFNH4HsFPixSc9KbOrcyY0JRFjmJDHi3JXCx5icTiJZhSzJaZPMR/hy0pf8K1HtmAFPuPIDkJ0CJzz5nMTFHA6s4WyICxdlOnQPfGr6XJXDe9H0uTNO86nI9ohjJ7gJKpmRt4GoGz12ibIMZdNkw1GUN8TkRpt8qY2mIbBm3Nw2iWOnuNK3BnkbsInOdB8z3Rq0HTsT894JI3hIF4ZRHMi9sybViLkSXAcTbPHcM0hWIXNdt01m2/id3p6SohfBbf41Lvp+CvFi/3kZG+AGdK+0fr91ndwKm5OAZ5vwrIo8eOguE5uo4M22FwrWGWOAeMAHeC9YTYvnS8ple3EoxHiDTpTFy7DowiEzSFYhy5mVJpk97F4w7LbCEXVzl/bxiRA4+ZK0AZELaQU3Sh92HU8BWQUsu45bJG7AUa29VoS+cQGzftq9yJLxyur5Ez7WX9WaK46Zr0HzcEcO1+nhjhly+RNj+8Ej6354g4kd164gNkRM7vAKVchKtjkP3Kd/hatRD/Ao4IyIPturD4BWEtRC0iRyBQKBhqCWzZcyZ3TCxDVn5JG7KovpVre41XVXgKCUm1P9juiTtGEE1CSh03gn+q58l9oWtfY6TDefSwbvyIrtlOOUBqTrEHV5H116YZ5wUmqhtRkzi2kkq5C5e20yN1IEw115Rt7B0unmq9ENtEUvRanQK3TIxZpueNdpeCtzpZArhbCkc1VGjWb2wBEgcuIBzUbd5LrJPc7kkLUanhJ64Axizp9HkI2r+Oobl39TW+qdM1zHV6ArcZECMrrtIav3GuJvzK09GyKGz7904QHWiDlb6Hq+DaZDHmFc3exMskq/CV/3Ax8/6IUq9Lpfp5QPKAXfPSKcuV1jeWUbZIK8kQid7K6OyuCpAvoakm0QBrJB9lTr8XcqwlbiQYi2Pd4DuTbCazveKEuW7jmlSDy++z0SqyZxQox3604D2SkwX63bJHEbPAefRa8pcEOsuSH7se4LVlpsPsvcPaDygBntoNPDmqsSK3Ct2PSC7lfBq+fQcdPDDfsC73x4XGEZK4ZB99PoHghADTczeJDKB87i8CEEupfG1r+Lv47hcT4OjwbRzRh68LoP+TIfgpncb4O/Ycd2RUshGygGIFPA5jwUVgoLYmtQO0DWnzv0xRXYsVkf4grOp8DrRLcoFGs4o710RJ+C1SKgpgCaPtHzl07vl1JsdM2Ar72i19/WXXjZEDGc16ihd49wmK3dq7JfLPLbkGmeuUCxIpWDXE7s4/dFO3Kvx2rPbNfI846+nADt6tBJpxv1eYwaalm1W8wXumnYal0354R4zm6/k8Pr94McnsVRHP4l8H64/f7+nz9e3r/YPx1evmdK6R2lncKTDsjh7dhWUu9oHv5k7CemvdlJf4SU9inkQyn1V04h5VuTUnjs5xAn6A39xN6+Zu/Sgz6JT6qC3nlylzQPSvH4rZdv2dM7dvujGARJGe5esTgmd7+xX4Zvbv68+e/NPzf33w6/DnfP2A93aTxEHA/6D8He0E8eEWozt5ND/HnLhHE05anx59gQJuoEtePfHhq/M/aKOvh8eDeMX4+f0RMLrYibhr/vh5+Hv4YyDR/rXbliIpeQ6KknNPcIiar5SQkHlXhiVDwhTaZkNKXy0hPOOop3eFEP9VkFOiOCniJdrSHpYSJrWqGNkt5FjjLaB6e0E2NRYxrJKmQmkzaZbf0LzlqpVVrDDkLYQKGaSNU7iil1MMEeuG4SySpkIbsmmY39C9IKL6VOKmGkl8anKDKIwJWgGPOwaD+JZBWy9K9JZpsGfFrtP63yH7HxpMpfWolnQqeJ676XtJtgPRdScRXVJigruNBpycDThAmlXDiECJNIViHHlYcWmY+wa1InUvfAo3c33OtgnZJTZn0AslPghFXPSWzrnOPSe2V41GlFGkaZobHxqSkIEkpaPd5KMYNkFbJUeJtkNvWPZoEez9lkc4QwnMK1tAREZMGVlGEMsGaQrELmazPaZK59cjvjPIZxUgCuObciVSOCdCI4qYTPRCCCDvJAJ5NAdgocQ/oGiY36FjTXZFIpXaEkxhhHQXny7pxzZ+nzUeEmkaxCFo1rkvkIeyCA8Mam+2mi37TOy8NB/Gkkq5ATFjErs7F/OoaAMkRZ6f+cXHhsKO605FKOdDcNZKfA0rsWiZsgY0cUoI2jb0Xi9ELkSl0kTi6JOh09cO7fDJJVyHzdXJvMRssVgh7IJ90gFSbFoIA3VUhoGrglLjuY7iSSVcixNtki87r4uJFtVxCTL5K3dUr+Ekr24aQCF+VIHtIlZiL9A6TkxWingawC5gCvSWJb51TWNRWfTchgtdR547aQniJHzVVJymeQrELm7rXJfPjkOhnJnlOen0LHtJfCHKq300hWIT+e3HmZnfJmKA+5THLhGUmZNafcy+nU+UhqyqbYhAzCOePj4sqYrE8AWQUcc/UGiRuwpijLUYjkg4iPZ2LsRuFmupSK8gtKYd1hHWkayE6BpXctEjcRvXpD8UFi3uQJjCZVsWkXmqBnlYpiueIHZ5CsQmYqb5PZo9cFiANavUA83XkLr+CBJbZBrSBxIDlWSMrOo0FoI7hx0qcLZbxQQnvnx+RtGskqZO5fm8zG8TOacj9uoyyd46OUFCplg3ZKHnz1FI6d4srYNcjbQJADXd/EHbnLe0CAYkq7g0Fwh58ezbiXRuZ9kF6bmGOFVIV0nkjJOn6IJ6aRrEKW0WySecbSlFXEYiJOrg7WexNJNjKaEyoYS+51pMgpIKuAhSNbJG6gFA9NkriLjivIoa/cw5x5bwZyMQxaGVaQH2DTEHKpDDr2QM6s8Jeg17QZ7TKbTNuXUuBDie7dvsC7Iecm0OF/kJ4Mi+vEB/ngnBRJn7XWPDL/aMPTSFYhxyM7LTK3wYHwdeFOe6te4b4wd6yjRgcwUCv2neCZIbqHgs5ccc0CgDK660E7LYHMkwArZl93jacTCsoG0TNbHbhiX9pN9OEBUfqVmdb5OLlrKIjuXKH+XScCapKgw7LOT0Bqu0XVh69fga8QQCfd0JN7VdTRj4M88Alc0IrTPxr/Acu9FZKnjQxkm8oJLsThCSaRrEKOu4ZaZLb1D/oI1IezhXaDWbf8LdaawO8PwHaGK6A86PG7Kko5L+lABxMcfIV7EyOMu0bfzf48Zg816Rz8SAG0OW+CUrvNX6yyAH1SFH23M/iWqW54D0GQramdCeQa0m9qS01JDKmvIqUVFynmI1d9OOxhtk0Y71ZM8/Mv+9nO4hZ0pMjhF980V+VdOBG84+XtoWPwP/9x6snxY/bRt9uGxwgeDNFMVAXpjfFSW58K9tyJ/GEZnmkkq5B5eNpkPly94E4vXR9loTuRFUTQqyJ0tKIJ5dgmvsUsjr2izIoeTdj0bkcZ8t3H4+ROI1mFLHTVJHPV0SFphVNkS8a7+JTcl+PMaRmKTN5wZU2Z1xkkq5D5qoM2mZ1IZ4i079h79DItPMvD38mCvlKLXKQT1jmunYy6Ict1LyI2hNU0CWSFI89PI1mFzN1rk4nPpejcAeJqthRvI+8e7dWJL7s8BDWexX+VH7KDBfARsKyxLgex/iLah4wKuObTg/kv45oN+GnkXP30QUmhvCPmSPV28qieS+XVlPkckaxCfjx48zI3UPE8C90vSebQ2RW6Z2/oPvK+MY5+0SXs3KPTEHTJay0V2WuiWfiICTxmRud5aCpFr0lCJ0R8BbtusMcP4T3KGyrgg3s6gMCie5JrfUcy+gIN9ORCW8YG6/y4VcpNRJXwjrKbG8i5qkWNGjn8uapZRUFcTZkLXrv6AAKZxRkNp2vGxjRjZdrVERefAvTIDDodXEOyBzt4V0Xd11EeQY8WOH6hFn09HHrfDvTgbYjvrqYuxGGPjjSOLe6qiuGKe6Vl2gXDpQqBOyejYXGrSJYemWQGyE6BqXNtEq/LUrstz/oHwAMa6FfZQyc73a6/ml1Dx1XoerkC3sGNE67KZM9r1PCbN5HPOay8UNPNKiGes9vv5PD6/SCHZ3EQh38JvB9uv7//54+X9y/2T4eX75nbySH+vE1/EiY++/AnYz8xYRzNQvmwNE4/P37zBHn4pCAnv8ljf4Y4EW/oJ/bqNXuXHuhJogERKd9Ymm76b/zWy7fs6R27/VEMQu/0cPcqXqQ43P3Gfhm+ufnPzV83r2/ub/7+dvh1uHvGfrhLTy7ik9N/CPiGfvKzU5sJR2PhS9eVpH+s9Dz/+dDRgju0f2fsFXXz+fBuOBERPybHJLTaOdKP++Hn4a+hjPsHfixkjaLINO8DSldCSpdV8s08eqjQH/hHNo88UdYnRu20iDhyT36I9wfb+MIbL2TRGP1ZnVoMEcczcC2tMyGZorJau7SLO16MqS2Xh6t5JoGsAuYBaZK4VPclTTeX+ZRWIA2geQg6jT7FKdo7w0fum0ayCpkfoE3m2oe3Uck/DhfnjOfjcLHBHL6ufrcjPn+l/oJG531xaanW4MmfERXkVYc4WjrwcrfgDJCdAvNpjCaJbZ0zKrggVTp44k18FwZF2TIbhuBOWlvioRkkq5C5e20ym/pn4o0zxpAhxhxEaaW9ks6mQCtLUrYU9KaRrELmil6bzKb+0V+UF4vEr+eVBZ7yuXxexpXeTePYKS7fGtkir6lnlP2adGdPNEdLGuMptFMpdUqX9cixEjoDZKfA1Lc2iW1kRBGypsQ83eFG8X25QiE2SKIhopPjgt40klXIkiY0yexk+XWoMA6zt8oHb/KQHzkg0KMZoSkQGuuMU0BWAUuZsUXiJmIH55U3khx8/HpMH9NZl2Rc3nKtvAyl2jSDZBUyk0mbzPXTiSk8yeO3SahTXPtUIXBOCUGxkiiTO4NkFTJ7iTaZbV6WEmotdD7RSfxuPSdPkxRFBk/i5FgVmUGyCpm9bJvMpULvToiPIER0L6dsLvCorDVCkM6HFEdabhU3lHmU/k0jWYXM/WuTuYngFNus8X0KbvCMPrOdjj+gY+pk2QCTtmbQQ0ijSEtzzdmZHJuVIZlGsgqZh6RN5jacAXTxCnzwrDVOq6B1rCLQn5129HiJFSktcPFIewlAZ5CsQub4uE1mW4Ccl251fvupCUoqpRIRObJqr4zWZXZnkKxC5gC5TebaPQF4DILtCfAtA3v8Hu4FoWsw6PkGH3eKJ1U6iqLHD1YENV4RMA1kFbBcmt4isW3wuMnXDcRH44LCR/JceZnceGuD0+OrpyaBrALmoWuSuIlMA71ABF1dQzfbznlfMnjd/51PF6DHmvc1ki/0Mui5glfCK5VmIl7oyMn55z1/gueEvDiZSRyrcOVWyAZ5a8//kHdvoBssukHA1wwBHMbVbBBAD9SQ8z/09RJ4IkKeXGizRa8vaqGssdKng5JEtU5YmQ5KWgoASKyWZWangawCpt61SdwG5116Tao578QNraGtF10B0f0aegALnT0B2yz4tGrrXDwbmA4JaikoFBIqFZfJ9CzptFYlnJpBsgqZ3VqbzLagQDhn41Ud8fHiwFPCrxIFOK+0IZLypVA0g2QVMgcFbTJXk9ctVnRdvxMCtzZoEoWeWWCGB9/QtA6GOg/iangQes774H3Z4F06lV7v4KEhznp6uRvdZvRmc9rZEX2SNoyAmqTuCbongNZO6BVU9GWOvjz+BZPbqedCqQn67jfwxUPourITh2980BjFju8WnkGyCpl3hbbJbOtfPvuhTHpPsdVeKKNUem6ngxPC6lLAnUGyCpn71yazs1Iv4K5sFzX0+Sb0YwV9rWg9VHNeMkJXjB6gbfr+Hvi0sRPTZYjJ4x6k34Reg/M+dGLZSQG2oAO8XQs8kkHfwtgLuJthBCIBtaOxd/k3teNbLQzXgvJuJa5rYQeXsdBvfAMOEVdAp1fDWJ9/NVl7dAGe7UMTHfoxW/QLvtBLObhLB53rHsF12A5sC2yNHl+hr09Drwh2zvmAc5rrg8Ms8qvWHJsmFfkEDLq54sYH8EQMO3LYJg1usOtwEe30Dj3W4NunsaNt8PU49Mwd3b9ATy42S/bBW/d5/Hbvhj6S4A4EffkbejEJfTcRdiUGXPPQLRd93QG9zIvgOZZcF4BO89DrEcgLtPDn5XDLYOguBtws0C/blEddJfCONFgEy+n5c6fmP049OX7MPvr2OlLM9izp0vncaj0X+u4YdArBjgK3MMLoeYD2mkvymiHK0kYrRfFx9u0UFlkjpC/2PYNkFTLfKN0mcxN53sUrIBvZSADuaKBTEPhAGnlmLx4mrpn94BeI0S0DoMq12GaGLYRruOeK0EcOW5VXQKXQ43dexNUU3lc+5xy8dLXy4V2p0XVEn4LVIvoUXBxxximAdmDrd3E9a1lTiHBeu0NXDPB0EXdbB24JBbvUezVMtGR2j+u9V3CABXrxn2OvuWFv3r4aLjkv24CvdqEfeEAnRPT5hfZ2nXO+Cueg7x+DjqmtVoY0WaSlGktmKLjncb8xzQdXFJe4woczQHYKzE66SWLfm7Axgz6ryUOHluh8hO7Dofmym3wL4jm7/U4Or98PcngWh3z4l8D74fb7+3/+eHn/Yv90ePmeCRcnY3A7ObwdG8THpDHDn4z9xJQUO1s+zn8+/bT69hF6+lmBT32bx34Nr+kh3tBP7N1r9i492JP4ZFIJ0mJKymn043devmVP79jtj2IQeqeHu1fxwM1w9xv7Zfjm5tXN/c1/bv7+dvh1uHvGfrhLTy/i09N/CPaGfvLzU5sppXdmfCztzU760vHSOHSzIA/t3xl7RZ18PrwbToTEjymYJ6e2iwXI++Hn4a+hjH78YlSeUg08qFk24GITUVvjMMyBhwpc9ncU4EcGcSr1oCdPjNx5HT9TQrtB2p12ktSG8g6XlcZ8Vq2gEDFwcIXw8nBoUhbpStFEWcHHi2FmgOwUmMe0SWJb5xrmsWzo+bx21BP+SYkw89OO+Pw7JM6qVkZ4cpuOPoxuPqSiPgXNMW8/Rqo5/Z9GsgqZ0/82mQs9AUUipCc8pHv9lJKG0oHUIH1WTnNvy5rQNJBVwNT/NolrN4x4btkEbrRM31YhUKoU0vWIXkpP+UoYKWUayCpgDnuaJK595FZKB+A+hIzssGuT+IP8tlUpJLUil8DceGp0Esgq4FgXaJB4DTQEbOmdCy7DBcAqAW1MK6cy9IiPvm6M184Znr1IEZWqUPKwyT8Xs6aRrELmYlabzE5GDyYj8KACmuWkNlIo5dKRfGu5NMREPq6ZKrImYzh9ZSjXskwiWYUsN7Q0ydyEJQLT7Mqd3NWw3LXwILrTB+bpZYwdnUuhtRdYO66GKRdLTrstbLVM1md2qyzXQ4yv7ETgDefSTm4bYfjFh3G1+Siw29tCqQdaMTtzfwFzo7tlaL6+uOaB7RbillM/neXpEtnY86im8VmkMPQwXvsyIDNIViHTiDTKXLsyLTMB3Uc/NjUFVg1o9/sV6j/QU3Fxyl/pyK0/VtqIoa6sHHwuz7ly0+mKt1XVXLnydtW8ZtXcIqJP0goQK5ukTuFbofCVqSZ60gVtGC3dx64XdLs+u01CazT+ehD6Sik6nWIX4TsfXYCPoF0Utj2tnu67xZ3fJsEtDlpfod0XNleB7w7oTNSZaEvLXvh0ADx460OAVfW6XXfVHBGkjWpngnT5N7WlJiLhWigplbiq+hZ9RdryK75EQqSLV6MypC594uPYEfapbzeNDfoOWWDewFasa6Gsz9+ZvZmVJ2hbAB+8S/Ms+PBg1zE6lT2MylrGFHtdreUJoKtH6x/ebnQPjR+gw0FshcR3kNCTC02F0CPXWU6vJ8cAnqweQT9+5KAJAt0vdYNdEbsCa/oySQM0CaKHwNCmDD2zm7erixPpgmlyV/Nt7ufaAgUDEwn69IPbjjDWKsmtTu94t8ZZx4WMiznSW+5DsKa8x3kGySpk6l6jzKtjeGBFWL2DuriefKpz6IehoOkd3TdCF5mBbfbhVQJkE4c2oR4hbTOJxvcswPyDrna4bHddFWDoqYC2sNWPbdfKyy8iLUYy6/cIuNq4eS5AQ5xxT2xH9CkARfQpuDhiZVMA7UWhg/nuni+h8+D6ir56iL52gF6Y7SZ/CZNHtypgL7UQ53a1X7y0Au3HgDUa30n0olkv4eLwTDflfkvq1dvTWS0Omv/RY2lge+oWtwjiObv9Tg6v3w9yeBb1efiXwPvh9vv7f/54ef9i/3R4+Z4ppUmTBreTw9uxIUx86OFPxn5i2htSrvJ5aZx+Xn3/BHz6YcFPfp/Hvg2v6UHe0E/s4Wv2Lj3ck/h0yhOUfIuQhr5m49devmVP79jtj2IQeqeHu1fxMqHh7jf2y/DNzfub/3fz983/pZ8/6M/33w6/DnfP2A93NBjsfwEbaNgjCmVuZHN0cmVhbQplbmRvYmoKMTIgMCBvYmoKMTEzOTAKZW5kb2JqCjEwIDAgb2JqClsgXQplbmRvYmoKMjMgMCBvYmoKPDwgL0xlbmd0aDEgODM5MiAvTGVuZ3RoIDU3MDAgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnic1Vl7fBXVnf+d+c3Mnfueubk3D/K6SciTR+INAQJBLgjhKQYISFBsLnkQXkkgYMWAoVoSo1hSgSCUYqqovEoDIiQQKNRUdBGtVdx2tesrlrINSHfxFcPJ/mbuDYLd7qfb/WM/eydnzmPO+T2+v9/5nd9MgAGAi24ieCdNmJgPg+gCNphGYyYV3DU7+cCQudSfTOXhSbPnjN86b896AOF1en7tznGFk81js38IgKep/8VdszN9S0ZVUV9so/7ckuWBaiHJshdAUqnfVXL/Ki8sjs2l/jWix8urFy1fMez+JQAm6sP+RYGaajDRBUoL9W2Llq0pT9satpH6BwCcxRVlgVJlwa/qASJt9Hx4BQ3YnzZtoP4E6g+sWL7qgVJZOUz9CuovXFZVEkgZlPEE9XV5hy0PPFAt1ssrAKKoC97KwPKy1MtjOqjvJXneqa6qWXXZ9tFSgAG6PMuqV5ZVjzb9hZrRxSRzBehY6Zz1n0AXQhiNuSGS2vozCwyFPBAm5E8vBMeywKpKeiLqk/v6AG609JlsadnKSlD0llFEoqDXCtUfGzOvMTch4YR/+Nf3/j++tp9C39v9dPT2/56i8WOkl0zFBnbSz2VgaKI2uzFDRzaEF0g0V8dHpbs+Lhoj5DKElRl0zK0GJQdAyAa74AC4DRs8GFgZWAhNgZXLK6Fp4crAYmgqCVTW0L2ibCXd16xcBk2LyqqovWhl2VJoqghU0pyKsoU0sjRQGYCmZYEqr34nW/5weWBVBTRVLtVHqhYFlkPTytWVNHNVeeUiulfo9P+GvQ29ly1eFLjF5iIEbc5guFFLpIkbYiAR0shzkHS0QoRRmyGctLeCh+5mmiPQGDNwC+r8IfwJCiDPBqY/GAA+TOyLCbwLQdZ6++Zff59VhNpz/g6zEcLC/X+HfecH5+r1jfZNNG4Zn39Tf+W3bWEkwDebqJ13Y2kaaWkJEgneRQdrIsRAypa2UzcuWOM/Q7ngIgpWGVERBUG8sSL0KyifWAp+CMAa2c3dbIdpOfvkljkYKjEABsdV1GNGX4R5VLvIF5EsEIBFsASWQTXUwP2wxrB5AEphMY1VwkpYrY/1Pdv3s76Wvqf7dvX9tG9n345bZQn9gjGlOtRTDDrBoktfSmUR6DsBiDYQTzA8f1moWKlUhtbrlHQca6josWM1Fd1mGpU1oRLGcqANztF1BvbBTvY89cppfAWNtAiHYQOtaoOX2TnWKAyhsefhKrxNMxvgHO4TgU2FbBoF+L0kUJQqhCNEI5e5Wa5JJpeeIR4RZ4lt4kXxPIwQa8TzYrFYw7LxGWmu9DyVXPw12ec1iIc29gHJeRwvYTZ2iBNEB3yA53EffEpcdL3PwSbYDbUki5tVQZ1QK8yikbPSedhBVxU9P892sbdJuuPsEbgAT6EoTIZd7ALpdQ6+gEewUKgjU2YL5ST/WaJ1ntbvgBoKIReYBbgwiMZIeuK10LjH4hDpgnFdhTriXAi75TbZbUoiLjpiz7OXWbe8GVrgbbwXV+B7bIOYJO4RJ8OmIAJYDJuI9g59jVzO1pDu+lWrUxe+LxazfXBJLDYtJNq/1jUinkeEWaRROXRQ+b6skk6j2QZsJEn1p7Fw3jRVzKT1RMG0jrQGqMIc8oEqen4QDsMQbIZNRMnQVx4hfUErd4ofkc6b2BPCF3AeJ0A6lItXdJ9wAzQDHDPJkogCg8FetVVInlLa6p85z/tqUcKQwd/pelWTtxUKWu1rvG19fQXzxGipqFWKacVkpVVMTvrobz38aMjgaQXzvK3XJ04IUZ1YPIHGZs+jpt6jYRqfOMF4pjNtlZLpb0pxq7ekwvuY+ljSqMfUslFD9P0g6Kci7S59z9X2fSIOITwtkAwd/tSoeGuE2QF7I+R2h+atjz8e057Upm2MsEEERtrNijUeFffEFLW3+/V3un0+LTf3Nsjs7LrWe61bfeWKekXL1XJduVn+yqzYrLis+CxvVkJW4thUf6w/zh/v9/oT/IkFsQVxBfEF3oKEgsSC1OrUDbENcQ3xDd6GhA2JTaktqVdT4/qX9i/qX1AcVxxf7C1OqI6rjq/2Viesj1sfv967PiFyAVvAEmWPOzzbN3wMG6El5ThYUmJKzrDh2Qk5w1KSEmVTzu0s2xcunPrgwA+qtre3tY3tePTAuevfMOGFbcVHC8tOzf+Pq0J2ee3Cmt8fSZ9+/Qf7ygNnnjl52lX3+NCh+1JTe3XUVhBW82U3RasYGOmPGtAODne7pGx0tLFtGCGCIkzSXNaJsYTONQKHkOnuutbZrXZeyTpaHLc+riUOSU4tOySeoKlAIjFD6qCU+Exb26hDa8/1Qd+5tYeun33hySf37HnyyRfwqHDf1917SgNsAlPomhDgnnMXL56jEpKrjmzohmio9g8EDzPXK49Knr1MarexE5Htrjbbxphoj6B4FJgmuJwTYwwROzUXmU/tutbdpZLt1GtXNN126WNjq2NbYn8TezVWGgtj2VhhrGdstDTYlKlkmgdbqqCKVQlVnqpo84IVpI8nIY6QHT7CQzp5wdAJTEOZjrlY13vYdv7YkrMLS36zlF/jZ1l678fM1CY89+iOdodw3/xTZ4cNO5gxmI1kFhbG7uB/6Nx25OAu/ezIJMC/IqzDoMgfI6nMpuyVWQNsc8gdFiGMclqzpNid1ulutXdaq6VwXjtFbf/IommtDqPdd3pkUV5nb15np8tw0S5fb7d6xUedLHbU7ynwtHiQRCchY1mCJ0FLIptk6+4ifNVacifL5G+1t7YePCm7txdUlGzqzcS3Ns04sV/Hms8V5xPWVjo/p/qTomyxZld9WHi7E9tTktpSO8ztzpMDYlOiQLFNkl0u78R0tbez3x06u4IOwS/oSOeSV2Ssz2jJ0L3CMH/QgSNUISExJTXHgHUMC7mKi1wlIofC/nNbtzz33Jatz7Vx3hM4MHPmrlkvHck9vPaN3t431h7ObRPGvPr++6+eff/9P/OP+aXYuBcHZ5z85T0lC9kohkxkoxaW7NN9+QyBTOe1kQsN8jvkU+Ih6BAkpoiQr6i9ed26vF293Vl+q2r2mwvMxeZqs8QWhJH3atmepDNt9BOLv2mR3ZeIXt97fK5Bz0qn5AR/jFUwgeOUzdQgnYQO2yFVUSX5LjtTbJCvGtS7cg3fIzzILgYYxEjzawVasVatBRm55UEsR98bQYbPvpR/2+LpBteN757eGdgup10iT/lWk8RjsE1gCuSL5N767svy21XJLxVIxVK1dFWSg+KT6LL7627dy45TQlBK1gyD4f4oNAM6mNzg0NpsHRYmKDBDj3j55GLdvms6wTzdeporIjfrSLHnTY+g7+YkLWg6AxYjxoilbWvXbj3Q3j7+xdVnXhF2X79X2PX0rlO7rzeIxQfLSj8L7djVhhdFkBeFye0uaLe1uTZGml3OmejyTIw0Nmhwc6pXsvxJY6NqoVauM9UpdeY6S5211lZnr3PUOevUOq3W1RJ1NUq7KQZSNEn16XsxKVG/CzVbDuzfuvnAgc1XmYtfufoX/hnT8IOLr7128U+vnr20k7/Ku/ll2p65tAvdbCRJeJz8fDdJqMe62/3R/bGuzbGRncSOWIpzk4yIl69HO58vKGtXf7jzm4Px7sM4kS1IvgENySJQWL7Z2VlNe/uoQ7WvU17/eu0hYSRFvBf0suf6QdmyrzTAO/hXdHUE2J/7A17QbjiVpNMgy++WrSbQrNjgaDN3mCyyAkq+S990hgdTlHvndT2sHSkIezpMt5gRmm4yVwROjZ8yeOcLJMfxDWFDY/CISzt36vphMlZ5iSQRt6q+T/AscUuFi/48u01wWGfHxylmwWSZHR8fN95ijYsXPVDPGkV3vacxsl0T25Pp0EyLs1jjo00wK1pxmBR34sQ0Xap3urt0d88NRV6Vf35F/fyK7lN0mJtUx2UtItdk3IsSD0MqYwv8y2MsMdYY21AKvoOtg22jzaMto62jbVYveNlAIc2SZs0Iy3RnejLC0+LS4tO96QkDU+st9dZ6W73dpWfigiBbZCva0I4OdKKKUTgAozFGjDWnZqaPTf9eel36+vSm9Jb0q+mRC4DCuieIkjs8nsUxj1tOuhGQyLUyCUM9VvrC8fEZe+Y3Ni7cMrbzuS9/N//lZeWvBB7eWLbfv/+pD98oPyKOPZiWVljon5LgyNjeuPNoUtKpnJyimdMKkp0Dtz6860CcHokO0l6YZ+xBN4z2x3y7CzdaWIe7zUZ70G2dQbsx36NvitygXbt8N7Zilee0vhXD6GANOv+NEzaFHdS34s/b2u44tPrMq+xNdlx4/nrg6adP7RZqv2k5UF5yFffo/jSG4kAdvbfJ8I0/FTVREgWNCZJeoSCDzCjdl8cLCL+UZIkSPUkEk/pO6OCB4MHjLpzW6im8Rx/QT5/OCP3c6faF7CpdNqlKqEiXixKZ/0eThSWUgtcJ9cJ64cfCbkHRGZnRTL7kYQNwgJgCKSwd00WvkgM5bBSOErOUfMhnU3CKmC9Nlv3KXJjLirBILFDKoZwtxsXiIqlCLlZWwypWi7XiaulBeQNsYI3YKDZK9XIzNLNtwg58SnxK2ibvkV6QW5XTygdKn3I7mT0s28yyWdKYl9l97L6X+b09YnFvIR74pkVHiOKBjpCTPe6/w6QIZg2cFs1Kr3NOh+YEp12z2UGvHHaL1WLTrFbLeLvVrIJVasCTDmuH6rDbLGYZQXGKTqsaRG9aq2IgZu2HsP/s7uzUIowdQpEllFP+FzAatXQ5wqfjeVUGSZHNaA+3RNhVe5I9xz7Fcpdlhn2+eb5liaXBvt6+2e6id16zbJVsVofVGcE8giqqUoTFbXXbBjgGOFNhIO0or+iV0pU0c7JloHWgLdWe4chwerURZIMcIUvMkkZahluH20bacx25zixtHPiZX/CjX/RLftlv8ivjzRMtk+xTHFOcfq0QZrKZwhwsEAukufIc01zlbvPdljnWObYiR5GzQCtn5UKFZbFjsbNYq1UecDzgbITHzBusG2yN9kZHo3O7eat1q22HY4dzt3W3bb9jv7NVe1P7QOvTyshikoMFj8ixjOnGEzbP2LJ287LphdkJfHTQjBWvPrhjcn2hOKN3Cy4LZVbSPjovB8Jkf1iKkUjZEiLtcYpmS1Dd05PpeO706amTmqfnT523gV8z27W9LmFAA0Ruk+NdHVZnZt4ffT6ed8VHSZUv65ZE6ttkarQ+btIf6NFC2tefWXGTkVwdbC1JTWFf35Jl9Wda29PSKkqCGRfJK04leT0QBbP8WnQ+RCjhTreoKBhukacP+FZenkenkN+l4F5QGxyRp8IPObaZoUNiurRXuJ7iXvFRntEX3RLdFL0+Wo2Wgif4d0QmiRllHOLUoKQ/f7Fdl/zr9nY93+mX8dgvdKHZ4UshTEMyjvA7I/Ipi7LYFEUVXY7p4bp8QfF06Shn3WsWKZHVzB12QReMG1IxA7rvpqPCVAHa+bFvM1I9cZFqv5OTEne5l7inwxCY74/MzI8YpGSo0R5lQIYZ4mVlYJw5MWX60G+B6vTp914Drojo+KS9AzXKrYecyjikwrZw08COqNiEzLy8Lp9P339qt4/+glYOWXPE8BE3oOq3+U1JtETo6Ym0bt67Y1JnPEKmvlM4rMMZsj0SmmT2oJXnxISl6mD2g9uvmtCPraFdCszwh6fl2xU1PNKtqGaIkZWEaHN80vTUmzQzFDPcIDLGuzdBExpsKds8poQO54C4oErX8v5aH3pNvBX+77wRBP30ZluE9Lihw/6b7XLDNvq3KaFoe8CXIH7Pmfc5xCvGB6m3HnJ09ddfvts73VFkfj/0nQr6v8iZlvNYAAf/8t2emY6iv/rKFSmeN74xgaCn84+T/12EWskNK8RuWCG8BZl6W8iFM0Ju33t6LbnguPgprKDx4ziV2oOgisYOih0wRi/SI7TGEiymDkI98xZ+zfAxm86eYr3CQCFAh9YJyibuwSfFceIjklv6Wp4pPyt/bJppajUkjcJCGAQVYCPvVGG7rpnoEcKp1r9DmWC+/gVQNJOiWca3Qb3NIJx6wbYACssPtfGmcfGmtgSRbEaoLYOblcMdUAXVsAZWwmJYRNxXgZfe0kpoX3jBB1l0ZVNrIc3wwniaswpqqKyEMgjAchhMo1OgkuYPpdY4WEaXF2bdoFVj9MqoLqM199O9lGZa/g6uw29wLSRO9xMv/WtTJc3W5QjQmv8ZxwnUWkLr5sJqmlFCcwMGtTJjRcDQyEtUKuleTXMWEt3FNM9L66uIe8B49l06sw0qNSQRveHDUhrVudbQ3CqDko94Z0POLav61whBJ+l7yPje/Ne/KMOn9f82TIBJMJlwngrT4U64C2aShrNhDlGbB0VwLxPgNJzR31RNqysX54/PygrV2aF6WJuw3t/3DcceN36djF/58Mtm/MKBn3O8xvE/kvHfHfiXZryajJ89Nk76jOOVZrzcjN09+Oce/DeOl0bhn8bjRY5/9OGnXbOlT5uxiyZ2zcZPPs6UPunBjzPxI44fcvzAh//qxj804/sc33Phv6zD35/A33F8l6a/uw4vvDNJurAO35mEb/82Wnqb42+j8S2Ov+H4Jsc3OJ5vxtfPxUmvczwXh//kw9c4vrJBk16JwV+HYyfHlzn+iuMZjqc5/pLjKY4nOXZwPMHxuIbt9clSO8e2YyekNo7Hji6Qjp3AY+vFoy8lS0cX+PvwqF98KRmPcHyxGQ9zPMSxleMvOB4sxZ878MD+ZOlAKe7f55L2J+M+F+4loff24B6OL3B8nuNzLtzN8dlnHNKzPnzGgT8rxRaa0tKMT3Pc9VObtIvjT2248ydR0s5S/MkOVfpJFO5QcbsFn+K4rdkubePYbMettGhrM27Z7JC2pOFmBz7Zgz9uOiH9mGPTpgVS0wlsWi9u+lGytGkBbvKLP0rGJzhufHyotJHj40PxMVLzsXHY+KhVanTjo/QSRwMNpVhPSNUn4wYNf8jxkYc16RGOD2v4A47rOdZx9Pc9tG6d9BDHdetwbSnWFnqk2mR8kOMajg848Ps2vN+Cqzmu6sGaHlzZgyt6sJpjFcdKjssScCnHJdp4aclsXMyxYh0uok45xzKOpRxLOC7kGBiFxT14nw0XcLyH43yORfMsUlEPzrPg3eFR0t0+nMtxDnGeMx4LPTibqdLsSJzlxplTw6SZHAuseBfHGXeq0gyOd6o4neM0ejKN49QpqjQ1DKfE2qUpKk624ySO+c04sRkncLxDGCLd0YPjT+C4aejnOJbj7WNc0u1uHJPnlMa4MG+0Xcrz9zlxtB1HcczlOHKEWxrZgyOGq9IINw7PsUrDVcyx4rA4zLaj7zar5ON4mxWzMq1Slh0zrTh0iFkaquIQMw724aCMZGlQKWaku6SMZEx3YVpqspQ2DlOTMSXZKqU4MdmKAzkmcUx0YgLpmeBCbynG92AcqRBXirF2jCEEYzhG9+CA8RhFnSiOkaUYQUhFcAynReFR6OHo5hjG0UUTXBw10lUbj+o6dJaig6PdFi7ZOdpoti0crRwtKpo5KjRN4Whyo1yKIj0UyQM8SKPI6WVMlYQhyFQEjqyNlW54gg36//CD/2sB/ttf7H8CcILv1gplbmRzdHJlYW0KZW5kb2JqCjIyIDAgb2JqCjw8IC9MZW5ndGggNDggL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnicY2AY8oAFTLIysDGwM3AAWZwMXECSG4h5GHgZ+IA0P4MAkBRkEAKrFAYAD1AAuQplbmRzdHJlYW0KZW5kb2JqCjI1IDAgb2JqCjw8IC9MZW5ndGggMjkxIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nF1SS2uEMBC+51fMcXtYXG2rLYhQthcPfVDb07IHTUYJ1BhiPPjvm2SiCw2Y4XtMGL8kOdevtZIWkk8z8QYt9FIJg/O0GI7Q4SAVSzMQktuIws7HVrPENTfrbHGsVT+xsoTky4mzNSscXsTU4R0DgOTDCDRSDXD4OTdENYvWvziisnBiVQUCe3fcW6vf2xEhCc3HWjhd2vXo2m6O71UjZAGnNBKfBM665WhaNSArT25VUPZuVQyV+Kc/UVfX7/Y8dXYqF6rXQD8QXUQ6wkcq+SaS95lgG707JJUT5FHlkcZAF6dIE8z7jQ2mIiN4T6YbJJWmKfKo7pBUGsOXC9WrD2T7dZ+Nv8g9eL4Y4zIPtx3C9jFLhfuD0JP2Xf77A92iocQKZW5kc3RyZWFtCmVuZG9iagoyMCAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvQ0lERm9udFR5cGUyIC9CYXNlRm9udCAvQk1RUURWK0RlamFWdVNhbnMKL0NJRFN5c3RlbUluZm8gPDwgL1JlZ2lzdHJ5IChBZG9iZSkgL09yZGVyaW5nIChJZGVudGl0eSkgL1N1cHBsZW1lbnQgMCA+PgovRm9udERlc2NyaXB0b3IgMTkgMCBSIC9XIDI0IDAgUiAvQ0lEVG9HSURNYXAgMjIgMCBSID4+CmVuZG9iagoyMSAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTAgL0Jhc2VGb250IC9CTVFRRFYrRGVqYVZ1U2FucwovRW5jb2RpbmcgL0lkZW50aXR5LUggL0Rlc2NlbmRhbnRGb250cyBbIDIwIDAgUiBdIC9Ub1VuaWNvZGUgMjUgMCBSID4+CmVuZG9iagoxOSAwIG9iago8PCAvVHlwZSAvRm9udERlc2NyaXB0b3IgL0ZvbnROYW1lIC9CTVFRRFYrRGVqYVZ1U2FucyAvRmxhZ3MgMzIKL0ZvbnRCQm94IFsgLTEwMjEgLTQ2MyAxNzk0IDEyMzMgXSAvQXNjZW50IDkyOSAvRGVzY2VudCAtMjM2IC9DYXBIZWlnaHQgMAovWEhlaWdodCAwIC9JdGFsaWNBbmdsZSAwIC9TdGVtViAwIC9Gb250RmlsZTIgMjMgMCBSIC9NYXhXaWR0aCA2MzUgPj4KZW5kb2JqCjI0IDAgb2JqClsgOTcgWyA2MTMgXSAxMDAgWyA2MzUgNjE1IDM1MiA2MzUgXSAxMDUgWyAyNzggMjc4IF0gMTA4IFsgMjc4IF0gMTEwClsgNjM0IDYxMiA2MzUgXSAxMTQgWyA0MTEgNTIxIF0gMTE3IFsgNjM0IDU5MiBdIDEyMSBbIDU5MiBdIF0KZW5kb2JqCjMgMCBvYmoKPDwgL0YxIDIxIDAgUiA+PgplbmRvYmoKNCAwIG9iago8PCAvQTEgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMCAvY2EgMSA+PgovQTIgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMSAvY2EgMSA+PiA+PgplbmRvYmoKNSAwIG9iago8PCA+PgplbmRvYmoKNiAwIG9iago8PCA+PgplbmRvYmoKNyAwIG9iago8PCAvUDAgMTMgMCBSIC9QMSAxNCAwIFIgL1AyIDE1IDAgUiAvUDMgMTYgMCBSIC9QNCAxNyAwIFIgL1A1IDE4IDAgUiA+PgplbmRvYmoKMTMgMCBvYmoKPDwgL1R5cGUgL1hPYmplY3QgL1N1YnR5cGUgL0Zvcm0KL0JCb3ggWyA2NC4yMzI3MDU2MjU2IDE2MC4zNzM5ODEwNTY1IDcxLjA4MzYxNDcxNjUgMTY4LjI5NDY0MjIxMzUgXQovTGVuZ3RoIDkwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nF3Muw2AMAxF0d5TeALLvzyHFRgEIQT7t4QGEdqjq2t8kPJKVCbaA9bY4BItLcHXxJCwzCo+CSVo3TC0iy+J9EdTPLwUv/jD73p6qETF0m3oTrTRDYqgHTMKZW5kc3RyZWFtCmVuZG9iagoxNCAwIG9iago8PCAvVHlwZSAvWE9iamVjdCAvU3VidHlwZSAvRm9ybQovQkJveCBbIDIyOC40Mzc0MDQ2MjE1IDE2MC4zNzg2OTM4NTI5IDIzNS4yODgzMTM3MTI0IDE2OC4yOTkzNTUwMSBdCi9MZW5ndGggOTAgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCnicXcy7DYAwDADR3lN4Aiv+xl6BQRBCsH9L0iBC+3Q6xhMabgCiTpKpbMghpJ7phffqQcrFhddgpgyZDUeSVKn7dEky7dZ8zb/83n+bRtozyoYfADs8TS8elQplbmRzdHJlYW0KZW5kb2JqCjE1IDAgb2JqCjw8IC9UeXBlIC9YT2JqZWN0IC9TdWJ0eXBlIC9Gb3JtCi9CQm94IFsgMzkyLjUyODQxNDQyMjUgMTYxLjYxOTgwNjUzOTQgMzk5LjM3OTMyMzUxMzQgMTY5LjU0MDQ2NzY5NjQgXQovTGVuZ3RoIDkzIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nGXMuw2EQAxF0dxVuALLn7FnXgsUghCC/lNGBAh206Ora7yT8kIUgERHeGOrkATQnc+vd8nSUOdjegoyRmE6JJu2Gre7pI9mf/3Ln//Px6QMQ/v0jWilC5wcHyEKZW5kc3RyZWFtCmVuZG9iagoxNiAwIG9iago8PCAvVHlwZSAvWE9iamVjdCAvU3VidHlwZSAvRm9ybQovQkJveCBbIDY0LjQ3MTgxOTA5MjEgLTE5LjE3MjUyOTQyOSA3MS4zMjI3MjgxODMxIC0xMS4yNTE4NjgyNzE5IF0KL0xlbmd0aCA5MyAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeJxdzMENwkAMRNG7q3ADWDtjZ223QCFRhKD/K+GCsrk+fX3oS4Y+RRLmZLL0gTQ0fYZ+Fnajc3joW2ZadTLjZBg31Kwfh0Wi0Lf6wv/3OmlDcmOffIjs8gW5JR15CmVuZHN0cmVhbQplbmRvYmoKMTcgMCBvYmoKPDwgL1R5cGUgL1hPYmplY3QgL1N1YnR5cGUgL0Zvcm0KL0JCb3ggWyAyMjcuNzIzODg2MzA0MyAtMTQuOTg1OTgxOTM2NCAyMzQuNTc0Nzk1Mzk1MiAtNy4wNjUzMjA3Nzk0IF0KL0xlbmd0aCA5MiAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeJxVzMsNgEAIANE7VdCAhK9ACxZijNH+r64n3evLZARPYNwA1JwiPTtwESPmKEm8J29iD68Vr8FC4m0uuCTxGqbysialWo1mqn/8zeeLOHVFlw4/AHZ4ABsIHhkKZW5kc3RyZWFtCmVuZG9iagoxOCAwIG9iago8PCAvVHlwZSAvWE9iamVjdCAvU3VidHlwZSAvRm9ybQovQkJveCBbIDM5MC44MzUwNjE4NjAyIC0xNi42OTg1OTQ1NzI2IDM5Ny42ODU5NzA5NTExIC04Ljc3NzkzMzQxNTYgXQovTGVuZ3RoIDk0IC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nF3Nuw2AMAxF0d5TeAEsO45/KzAIQgj2b4koEKE9erpP8ADGFUAryNMqBBfpFJK9FV6zM4UlV+E5vFNzNnFckiKiVB9mSjX29p9//M3PGXHyGlc2fAfY4AZv1B68CmVuZHN0cmVhbQplbmRvYmoKMiAwIG9iago8PCAvVHlwZSAvUGFnZXMgL0tpZHMgWyAxMSAwIFIgXSAvQ291bnQgMSA+PgplbmRvYmoKMjYgMCBvYmoKPDwgL0NyZWF0b3IgKE1hdHBsb3RsaWIgdjMuNy4xLCBodHRwczovL21hdHBsb3RsaWIub3JnKQovUHJvZHVjZXIgKE1hdHBsb3RsaWIgcGRmIGJhY2tlbmQgdjMuNy4xKSAvQ3JlYXRpb25EYXRlIChEOjIwMjQwNTI5MjAwODI1WikKPj4KZW5kb2JqCnhyZWYKMCAyNwowMDAwMDAwMDAwIDY1NTM1IGYgCjAwMDAwMDAwMTYgMDAwMDAgbiAKMDAwMDAyMDY4NSAwMDAwMCBuIAowMDAwMDE4ODUwIDAwMDAwIG4gCjAwMDAwMTg4ODIgMDAwMDAgbiAKMDAwMDAxODk4MSAwMDAwMCBuIAowMDAwMDE5MDAyIDAwMDAwIG4gCjAwMDAwMTkwMjMgMDAwMDAgbiAKMDAwMDAwMDA2NSAwMDAwMCBuIAowMDAwMDAwMzQwIDAwMDAwIG4gCjAwMDAwMTE4MjcgMDAwMDAgbiAKMDAwMDAwMDIwOCAwMDAwMCBuIAowMDAwMDExODA1IDAwMDAwIG4gCjAwMDAwMTkxMTAgMDAwMDAgbiAKMDAwMDAxOTM3MCAwMDAwMCBuIAowMDAwMDE5NjMwIDAwMDAwIG4gCjAwMDAwMTk4OTUgMDAwMDAgbiAKMDAwMDAyMDE1NyAwMDAwMCBuIAowMDAwMDIwNDIwIDAwMDAwIG4gCjAwMDAwMTg0NzkgMDAwMDAgbiAKMDAwMDAxODExOSAwMDAwMCBuIAowMDAwMDE4MzMyIDAwMDAwIG4gCjAwMDAwMTc2MzUgMDAwMDAgbiAKMDAwMDAxMTg0NyAwMDAwMCBuIAowMDAwMDE4NzAzIDAwMDAwIG4gCjAwMDAwMTc3NTUgMDAwMDAgbiAKMDAwMDAyMDc0NSAwMDAwMCBuIAp0cmFpbGVyCjw8IC9TaXplIDI3IC9Sb290IDEgMCBSIC9JbmZvIDI2IDAgUiA+PgpzdGFydHhyZWYKMjA4OTYKJSVFT0YK\n"
},
"metadata": {}
}
],
"source": [
"fig, axes = plt.subplots(2, 3, figsize=(7,5))\n",
"axes = axes.flatten()\n",
"cmaps = [\"Greys\", \"Blues\", \"Oranges\", \"Reds\", \"Purples\", \"Greens\"]\n",
"labels = emotions[\"train\"].features[\"label\"].names\n",
"\n",
"for i, (label, cmap) in enumerate(zip(labels, cmaps)):\n",
" df_emb_sub = df_emb.query(f\"label == {i}\")\n",
" axes[i].hexbin(df_emb_sub[\"X\"], df_emb_sub[\"Y\"], cmap=cmap,\n",
" gridsize=20, linewidths=(0,))\n",
" axes[i].set_title(label)\n",
" axes[i].set_xticks([]), axes[i].set_yticks([])\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jGCV0g2VYHZp"
},
"source": [
"\n",
">note: These are only projections onto a lower-dimensional space. Just because some categories overlap does not mean that they are not separable in the original space. Conversely, if they are separable in the projected space they will be separable in the original space.\n",
"\n",
"From this plot we can see some clear patterns: the negative feelings such as `sadness`, `anger`, and `fear` all occupy similar regions with slightly varying distributions. On the other hand, `joy` and `love` are well separated from the negative emotions and also share a similar space. Finally, `surprise` is scattered all over the place. Although we may have hoped for some separation, this is in no way guaranteed since the model was not trained to know the difference between these emotions. It only learned them implicitly by guessing the masked words in texts.\n",
"\n",
"Now that we've gained some insight into the features of our dataset, let's finally train a model on it!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "owRtbr8ZYHZp"
},
"source": [
"#### Training a simple classifier\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XE_N9ZiXYHZp"
},
"source": [
"We've seen that the hidden states are somewhat different between the emotions, although for several of them there is no obvious boundary. Let's use these hidden states to train a logistic regression model with Scikit-Learn. Training such a simple model is fast and does not require a GPU:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"id": "s1m9mOXIYHZp",
"outputId": "785b6df4-c077-4a8d-b550-3e4d55379c14",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 74
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"LogisticRegression(max_iter=3000)"
],
"text/html": [
"
LogisticRegression(max_iter=3000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=3000)
"
]
},
"metadata": {},
"execution_count": 49
}
],
"source": [
"#hide_output\n",
"# We increase `max_iter` to guarantee convergence\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"lr_clf = LogisticRegression(max_iter=3000)\n",
"lr_clf.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"id": "DzGTU7lcYHZp",
"outputId": "6654ee56-433d-4e57-9814-bdfc6a814fe5",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.634"
]
},
"metadata": {},
"execution_count": 50
}
],
"source": [
"lr_clf.score(X_valid, y_valid)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B9weol45YHZq"
},
"source": [
"Looking at the accuracy, it might appear that our model is just a bit better than random—but since we are dealing with an unbalanced multiclass dataset, it's actually significantly better. We can examine whether our model is any good by comparing it against a simple baseline. In Scikit-Learn there is a `DummyClassifier` that can be used to build a classifier with simple heuristics such as always choosing the majority class or always drawing a random class. In this case the best-performing heuristic is to always choose the most frequent class, which yields an accuracy of about 35%:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"id": "DUAroofNYHZq",
"outputId": "ff8bb8b5-29ef-4e92-9e9c-ccba644727a4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.352"
]
},
"metadata": {},
"execution_count": 51
}
],
"source": [
"from sklearn.dummy import DummyClassifier\n",
"\n",
"dummy_clf = DummyClassifier(strategy=\"most_frequent\")\n",
"dummy_clf.fit(X_train, y_train)\n",
"dummy_clf.score(X_valid, y_valid)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vg8K9EZmYHZq"
},
"source": [
"So, our simple classifier with DistilBERT embeddings is significantly better than our baseline. We can further investigate the performance of the model by looking at the confusion matrix of the classifier, which tells us the relationship between the true and predicted labels:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"id": "NOIA8_tKYHZq",
"outputId": "b4a5f931-f828-4650-a3c3-022a2b06fdf4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 551
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/svg+xml": "\n\n\n",
"application/pdf": "JVBERi0xLjQKJazcIKu6CjEgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSID4+CmVuZG9iago4IDAgb2JqCjw8IC9Gb250IDMgMCBSIC9YT2JqZWN0IDcgMCBSIC9FeHRHU3RhdGUgNCAwIFIgL1BhdHRlcm4gNSAwIFIKL1NoYWRpbmcgNiAwIFIgL1Byb2NTZXQgWyAvUERGIC9UZXh0IC9JbWFnZUIgL0ltYWdlQyAvSW1hZ2VJIF0gPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9QYWdlIC9QYXJlbnQgMiAwIFIgL1Jlc291cmNlcyA4IDAgUgovTWVkaWFCb3ggWyAwIDAgNDE4LjM5OTM3NSAzOTcuOTMwNjI1IF0gL0NvbnRlbnRzIDkgMCBSIC9Bbm5vdHMgMTAgMCBSID4+CmVuZG9iago5IDAgb2JqCjw8IC9MZW5ndGggMTIgMCBSIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nM1ZTXPbNhDFGb8Cx+SQFRYfBHCMm1bTHDq1o5keOj2oiuzaI9u1lcRtf30XlBQBFkGvkk5qzdAiV+B7D4u3AEhP3iw/XS6WZ9MT9d07OdlfLdYS1RUdF0qrKzoeFKopHRdS09W1dBjBpmSDp8tVeWlTgGR1ZzzFdX35h5TncvKaYNZ021TKEMH7/jaH4HwPjICpCK3KkA0OdKLY/sbPoR78Tj2GtNZA53Zf90v1i7pRk9cma6A+0vGQtai693e7G3J3rbVguwPkxbWa/Ijqza06lafqbgeqKWkZWEPcQlNEou7AhKqrRchSknJX5Qnl/EHe0V+tXmnCiQgJDUZqZDxYk2nlyUxOfkCFRs3O+wGZvZe/qhdiLebivbgRSzpbv1S/qdlb+f1Mnspei8QOIWCtYR9qa0BvIBqehCtxK/4+pDYYAOuRLkJtaqMd+GCdZZGviPyTWA7QBwM+1fT70Ai994A2MunnlPoLSv79oQBrImhbCShCbQEWO9AxV8/T9OdEPR8kjxZc7b0iNELuE1j0Fln0a/FR3Is/6bik82oQqgHVYNJmQBOEMcCfXyrqfTJBR5s/FKLOGeqT3X0otCTHX4qF+NCfKfLAXPxO56vDPBSThuvABspD8KCr0KALsv1zC4Sow9dU4F6BSZrmxUrBNjSkwFsIIfbjEElmNy6iUYMFufXgUk2+CQ2R58T0xW91Ho0vq8A9OUZNpBX5NjTowQ66vgm1dnGcfKT+CnrjoMOafhMa7HuAblMASMbTblxAswL39F0CrI23iQz6zoLpuTvS96TxRqovg73KsFRPuYox60m6xx7BnFEBOl+X2z3xLFtlZtTbzX6hX/rqFfVwqR9Yw+W7ww3A9eAGgFqyNg9Vu+3dTUSd9UPqaMINzmM+p7kfbaB1gFJ4f7HfMbRbnU3lKEY5wsl8Xt1pSiRHjA+yFiCCwNpfegNOxasBY6RCNsllXoeJXJPJs/JWQ1k1PJs2Gz5GrHYIjvZ//riOYN2Rao2g6S4dh6aFaaHRKmbssdp0A80i1WI4VptvoQXK8JEG0M28FW6i3RpNaRywVLtpzPyjpi9/laOmL83C1RnrAXkeri99yk94ywulT/loluHT/wCt8CkfjeNTesrCwClI+//6lKnTie6Z+5TZEducASufMtG0iByfstFYPmWjtXpa+BRDB8E+DWbq+bThJi4aCsdYIrloj5bIb1JDpVm4Ol3d6+dRQ6VP+cPX3EUUPuUPH2M+pS96vOZICxyfstE4WzkuWru8Kzex0b79DqcyC1OnrzcGz8T1pU/ZCW9NWoVPO3qsZO3DW1ilTZlgtun50qVMsHY1liZlgzUfEAonsXPWMWYdJpipc6b7fwhs3mhXg0FP29i/5Lf5VU4qYR24ArhH/knc5rckGMB2uP3IF+JazMVKXIp/tq/yFtTsRpyLj2JN0Xyu+jYf+rccf5EwuRN2Kv8Fm6zYwgplbmRzdHJlYW0KZW5kb2JqCjEyIDAgb2JqCjExMDMKZW5kb2JqCjEwIDAgb2JqClsgXQplbmRvYmoKMTggMCBvYmoKPDwgL0xlbmd0aDEgMTI0MjAgL0xlbmd0aCA4NTYyIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nNV6eXwUVbb/uXWqqqv36k531k7S2RO2xI4BwiItQljFIGEJipNACDsJBFCITgCHRAQnQSAIIkQEZBMDMpBAQNCMisioA/h7PnUYBLdnRGYejmNIbr9T1QmLzvxm3l+/z68rt+reqruc8z3bPZUCBgBOOongHTJocA4MgDEArBvd9QzJfWBM1I2UZmoPBRCeHTJm7MD1X+86D4Cf0vPr99+bN9Q4IPM3NPgDav/tgTHpvhk/lawAkN6h9rgpcwpLheGW8QCyl+bwTlm0wAszorMBDH5q8+LSaXPm3b1oJoCR2rB3WmFZKRjoAGMLtS3TZi8uXnHuP++i9t8BojOmTy0sUia9UQmQkkrPe06nG9atho3ULqB24vQ5Cx57a1nIy9QmmuDg7JIphauHrSV6U3tT++E5hY+VipvkedReR23v3MI5U1O+699E7UNEz/nSkrIFV7+/HA3Q5Sl6vq50/tTSvoa/ULWrh3iaDhpWFgj+BDoQUuheKnSjuvbMBD2gHwiDckbmgW124YK5EE640i8QALhZ03qyWVPnzwVFq+lFpBm0qwICO6T1ZO8xTvO5Afgpfdx2/XxEP5+j8mngCJVPtVoHRRCwai3+Cqe+PNhfH8u383P8Fbqe1lun4N/6BecNzPrXfW6n4d+f+V/0CXI9qKN1jng9d5OmT7X2zZ46V4HC/w0FnVQEbsMi8Atc+L/LFSNJqeCAUAiDaIgBLyRCEiTTfe0eu9lP05hbPxEkOssdLQPJ3qhrQsJtI+78aeM1LZJoVLC/iVpmulqAZA82sOt0OCEEXAAdOrkF9oFL18klhfMLJ0NN4fw5c6Fm8vzCGVAzpXBuGZ2nT51P58XzZ0PNtKklVJ82f+osqJleOJf6TJ86me7MKpxbCDWzC0u82pl0+zdzChdMh5q5s7Q7JdMK50DN/IVzqeeC4rnT6Dxdm/+f6L+O2uwZ0wrvsAERgjbAoKd+lYgvF3ggXrcxJHSjoKt+jYAuhEYUpNE5gp4KdI/REdbB85/ha8iFfhYwfKYDt5yWLyDgLnRAX3AnsJ1tNr2jPvbfELqN5l30r/vBxGBf7Xqzftscd9yfeFt7/q26QP7rRjXV+90cmkpcmoKTBM+ijdVoGiVlSuQRWUzwiv8HigUnzWCWERVREMSbIzp+ucWDi8BPOrtEdnEX22SYwy7f0Qc7CvlAfcXV1GJ6W4RSuoaTziFJy0veLxtGEe55MA1m0rNFsESXvPYko+NJIcyA2TAfHgsEApcDHwUuBE4FTgaOBw4E6gOvBvYHXvmn2t/pe9d1tMz6zMGi0dejo2gcZlChOKNb2KiOYteY7SgqlbyO4qCieY9pVAgpog+IeiC9AqIUdB6jqGjyWNSx3mNUllBJYlnQAGfoOAV7YDPbSa1iuj+P7tQJB2EFLKQ7b7IzbKXQne7thGtwjnpWwRncIwIbDpl0F+BjSYDrLA8O0RzZzMWyDTKZwyjxkPig2CB+JZ6FXmKZeFYsEMtYJm6Txkk7qWTj70m2pyEWGthFKIOj+A1mYpM4SLTBRTyLe+ALWkXD4wxUw3YoJ1pcrAQqhHLhQbrztnQWNtFRQs/Psi3sHFF3lD0JF+A5FIWhsIVdIL7OwN/gScwTKgjmTKGY6H+b5jpL4zdBGTmxC8wEXOhK94h6Wmuyfo7G7tIF/bgGFbRyHmyXG2SXIYFW0RDbyd5kLfJaqINz+DDOw0/YCjFB3CUOheogAlgA1TT3Jm2MXMwWE+/aUa7NLjwqFrA98I1YYJhMc/9e44jWPCQ8SBwVQxOVR2WVeOrLVuBKolR7Gg1nDcPFdBpPMxieIK4BSjCLZF1Cz/fDQeiOtVBNM+n8yr2kv9HIzeIl4rmaPSP8Dc7iIPI4xeJVwlpzsLUARwyyJKLAoJtXrReShhXV+0dP8L6TH9e928+aXtXgrYfceutib0MgkDtBjJLy6yVPPSYp9WJSwqV/9vBS924jcid469sHD+qYdXDBILo3ZgJVtRbdpvuDB+nPtEXrpST6G1ZQ750y3fu0+nRCn6fVqX26azYjaDsMsknNgj8hI98PnOqOI/CkwCIgXFTbWu6C9JaMXpnuhE/OneOcRlQFLovVJEUzedcEf4hc54Q6yxrn6nCjxx6DHndUOI26TuPUK9db1KsZLF5wqM5Mn9OhCik+cKiQEK+dhVWbX3iB/l544QYz8h9v3OA/MqOUy8/y96icZZl03M0y63gZr+RVvIw9wxazJewZzRddIrOeSN7bBH6/eyDWiUKdtMwAdUYlVvYgxDKzen5EvT1vQiN19vfOb2kOMuK73nK+JYOwyo9nh+xoF4VJveIcUlZSpiPOHcfZcL6RTX2XDW/bvkcsG9owtPXCHpqA5CoOJ449sMWfEhEZheEehySCQ5LEgeqLjnXWOtcakewbVJPATJ4wFeVotW1EvTtvRH1o3kMj6l15DxElGDjZO7/5fMvJkw5ndgc113VqDKr0nUH6jtV7VEdYNtHm940Vx0njDEvEJdKiqKoIA1l/hBhJauBZAIvkhZFlUQs8y6EyYnnk8qjlnl2wK8oxCSYlERNZPaHXPSzr7uSEeNmQdQ/L9Ilul2yQgVzOqbaRBGNm4f0vV/7q3GNLzk/4mrkGPxTBr+/Zs+dRtqbPnA3DHq0deN97d/m+fuPhHaXR/FvifjPJu4y4T4VSfw9wh5gqjbGV3pA6t7XOuFb21HnXJqyRV7tfSgv1hAC6IjzJXtWDrlijnKaBEJrXyb9R558AuN5CXBICasuV61da1C+vqvpBqGQwv7EopjC20FsUJ8IkFsPcLjEuPjklK4YY6UlcdWVZwcod7OGANS/xD/jXj7w9M++dOSfebtyx//D6LS89N+bE/LLT+V8yy28xKba55rO/JiW9eZevtvo363c+WlpWnph8yOv98ODjezVLKCIpbyedEmjvtMwfzaxoBUTrQECzoU5iuMzILCbwyIposamfjqg3E2NWnTGLxtj5fs0tPocm1yvn+7X4iBddsOJpEu5pTaRdzLRJGQr5FFAehafBEMq6QjLrij3ZKPaA5QHrOFbMFrIluIJZSZRGFoeZDjI7R4IjLgtlLjCexS9cON3+iJTUdhnPtmXu4nWs4E2S0BaSUBFRHg2P+BPESIOjUo2OrDO46tSVVqEOlllXG7bHhHmYCT1gUuUYtY3dLhdVI7/DWlTNWkhEavNVzYA1Cybx8OagdEJIvxwa5uB2wR1i0aTxGUa013Wb0K2VJfLz/PtH3pw+8eSsV95995XRL+ZJF/bwZ+12fvW//sJ/8HrP3JVxePPmw4nJhHY1UV+r+5NEmOBPDJHBWmmBulC5zhO6Q62zrIxf41mdZIk3eiJiQjwYFxuVRA6GlOiK7mKutF25pT5+F8VedlY4i2fFM9IZmfg+GCNMYpNYvOx2hQZpZe4eLCFewE5GEryaO4rzhQrbn9q69SkqzDjy+ZHvnLP3PTjrEpP4tc95O7/KclnUyOex79FtLx479uK2o8LihsRk/lf+/fhJ/Ptvv+T/pTuoyWxHjOahdpE2TSeZyDDFHy45BBTQIZK/kEgeKCETGcgGte29ZodmCem3+QEqpC6agCYcp229nyajJJhk5OjVO9/vnCAwGSOlbGmoNA3roV42kLaQYFgCi9uFJ9s/P8d4e6Z0YVzrMqmrthNaRfiu0vFNgHS4z58UTuimyHUx3euca2JWp7yUEW5J7OJxJ3rsRvLe5MLtcVEZaltzy/XmFh3YTlvVW9lkpLeBmdSDfE1ipi9UczK6uSbEJ2bd3TOkswNphrCqZseOmpqdO/iO5Wsg8KeLfM2yZ1/iP/74I/9x+9A1Ty5fu3b5k2uE32+qqtr0fGXVpnHeg0tf++CD15Ye9Ma/Vf3x119/XP0WK1ywfPkCKqQxy4ijKuIoXNeYBENsBKuEiDrTDrEOVobG1qlrQlcnGTyeuJAYiI/3WHWFIfI7Y9KX/IdOfQltjngj8mTUSc/J6DdimmMNe5xNzm+cSBrTS9dtZ4iNdAWy7obMoJbEJ7NOtgiDSyM3jyA96XNw9p/5DaZ+zpA5+AH+xcjN7J4OXYolLWFW5hz3MLN/+yUL1cPZVv5QjLChU5M0fTmt79QvUO622B9jcDCBCQ7y2gMNAsKriiQzg+ARexpok20iZtpagmpDjia709F8p5X8+IOxZsYm+TN6Cr0NQ4UhhhlCsWGpYJCZUXazSDmHDZPHswnyVDZDXiyvYE/L62lPtdWs6lrEHBQImSPhNFOF2mZ+rX1ms3ThRqx4qbWreOlGLGFPdIrPEJ0K5ZR9/OFsvQrrjcucqkmhVEKKsA5wgMcouohGX5vm5nXdppjrN9vdse4B7l+5X3VLZI+ODu1IitMcuNiVFmVr+TObNj3De7N3bjDGAzf4u1J6+/vPVlU+u/PyJ5993r4LWKCV1v+G1jfAcL9NFtbDMpH5ya/5JUU9T75AX8+XQfZj0uxH0e1HAaXTfkLAGAsqsRdrUI1+Y6lxq9E4CTW/Rn5WFr9vv3qm/Sp5q9YLmvUIUE661p12mSbK4Jso+seaw4w22B0mN9oc3srYo57GhAbH6jALhGG41aiYY1FxDU4m9t873+LzBaXUfOV6G6ndW7otObI1vZubEZ0RkxGb4c2Iy4gfkOKP9sf4Y/1ef5w/Pjc6NyY3NtebG5cbn5tSmrIiuiqmKrbKWxW3Ir4mpS7lWkpM59DOQZ0DCmIKYgu8BXGlMaWxpd7SuKUxS2OXepfGhd/uAfuzXo6ELE2tk8lOM+Nuj6WhwomL+5aVbGxsaBjQ9NS+M+03mPDyhoLDeVNPTPzva0Jmcfnkso8PpY1sX7anuPDUtuMnnRWrevTYk5LSpmF1lLDaLrvI03igtz8CGy12Y2O4e7W9IWpDBDidQ8ItshKZE60ph++6viG7ovmYt65mHC6IWRpTF4NEZ6dtEalMd9G0byRaUygUZuIXLz/77Mtaaf9tnwPl70Eg8F75gT6NjUL6ma++OkNFeLCokDfxv9PRVFi0i6hhMC9wGb8iGUbAAH8UVLKnRFul9SlTo0NsDCPhRRqcVhjqGhyptl3xdfoIfv2q+sNVTW2j1KilUTVRdVGa2uqho4O6Xm7dNwRjB3416oXc195667XcF0bdv2NSO/+IdWfy2G1i1r6uXS+fPXu5a9c9iYnEkI05WZ8EQouoEicSfWoQrchGsLkaJWW1rYFtwDARFGGIw2keHK3vpn2+m2g134EWGWxQmLTJBhIgu83v4raGhj4HHj8TgMCZxw+0v0247dpF2OFh4ZGfWnYVFbJBTKFjUCF3d8DXQVcFoeWiLLfUnwhuZqxUnpLcu5nUaGHHwhudDZbVnii3oLgVGCE47YM9OonN+q5WAy8YkK8HPWzagOjS6LroD6KvRUsDYAAbIAxwD4iSuhnSlXRjN1MJlLASocRdEmWcNE8DOE7fUujY6h6XFMCgg24QK9oOWs4emfn25CkfzOLX+dssre1zZmgQdjy1qdEmPDLxxNt3372/SzfWm5lYCLuPf9a84dD+LZp3TSf1/DthHQL5fo+kMouyW2ZVsMEmN5mEEAqvRkmx2s0jXdqO1aRthMzaRmhEvU2vaxukfs1t/ZqbnbpBXyHfpl6lPR7FwsN+d667zk0hw01ERrOgK0nIytSMS/h7/ZT7WTr/sLG+fv9x2bUxd/qU6rZ0/LB61LG9GtZ8nDiRsDbT7nq4PyHCEm10VoaENtqxMTmhIaXJ2Gg/HhmdHAGKZYjsdHoHp+lxOagOzVeCCsEvaEhnk1Z0WdqlrsvPbChMFW7t1PqzDlWhfCw0LCsTt+1Yv27HjnXrdzRw3lq4b/ToLQ/+7lD2wcf/0Nb2h8cPZjcI/d/59NN33v7002/55/yb6JjXunU5/vpDUyazPkzbw/SZPGWPZvmnCOTFhC+SX+5KfvmEeACaBIkpIuQoahuFKQ21Ni0MaF4311hAnpfsKYS0V9vknmqgn1hwo052fUPzBT7h4/T5zGCHQX6PWTCA7YTFUCUdhybLAVVRJfkBK1MskKPqs1/Jdt7ao+hg0EIOvyPXUeAodQQXcskduUNwwZd+l3PXjJH6qqs/Orm5cKOc+g1pyi1O4o/ABoEpkHMzC/ZbVckv5UoFUql0jUKyTj6RLrt+atG0jAK5IZqkGQ8T/cmy0xhuBzna4LZURXuxIaopQjWAw64ocq5Dsed6wskVJuhxso0ipZ4R9ut35bqeRjjDiIOQjMTcxNLEmsQ6Ol5PvJgYSDSSbHU/7SbcgkK+vZLp1h+KaYNPLn/1ROP8hdU7G+c/+szOxsYB9YuX7MWVjy/64fP2h4UtL24+sb29Stiy7fnXX2qvEgv2T5v8eAcHYhFxEAI9yX8bAW1MrrI5GixNJiYoMEqLcDl6cNfddz9N/3RiDxW433cLmj/6B+QUNTz++Pp9jY0DX1t46i1hu0bA1i0aAbTw1KLvO3zOQt0OwsgOQuRGJzRaGrR3Ck77aHS6B//snYI/YUBEOZTLFYYKpcJYYaowl1sqrBW2CnuFWuEod9ZFXItw3Lnrv+PVQ9m6fXvXr923b+015uRXr/2Ff88cePGr06e/+vqdt7/ZzN/hLfw7cjDZ5EdcrLcW28hStxOFmre+xx/V6a0bbKvZcWyKJk89RPfZt0U3Sp86HbbfGPTYf44R2aSkm9B0hLY7Ql5ZY+OtyCb07ox3u9r3y6Y9t8U29m2nyw7KDYcTdQ6gFEg2k56ZscrWYGwymGTa/OQ4Nbeh2yD56fPvaY75UG7I1hBNYsGIdktcYTg8dli3zS8THUdXhPTw4CGn48yJ9oMkrOIpkkSrlVA8fZtWS4Gv/P2sFsFmHhMboxgFg2lMbGzMQJM5JlZ0U5xdKboq3SvDtTibRHE2NcZkjo0ywINRis2guOIHp2pUnW+5ohlsdnZn4P1BC7yaTunbWtt3tIk06Gfa20KKtred4zF5zB5LDwof3czdLH2NfU19zX0tZi94WaKQako1dwlJd6W7u4SmxqTGpnnT4hJTKk2V5kpLpVV7o8sEQTbJZrSgFW1oRxUjMBKj0CNGG1PS0wak/SqtIm1pWk1aXdq1tHDaGM+7Ffdj9XcScsLtyW8605KgnoQdrhq1a+LKlZPXDWje8eN/THxzdvFbhctXT93r3/vcn/9QfEgcsD81NS/PPyzO1mXjys2HExJOZGXljx6Rm2RPXL98yz49c+xFDvCv0hayQdoV2CTFjrvBwZqUKpOZMCYdU502zQb1gOTT41FL0GuQ2zv4qptpVqhFIVdoXy0mJWdp0cjBHmXlfMWIsuPHL2yrqpK28Deq2+tWjtq09Y9CQTW7R/Pi+8kKJ+jW74K+fs8t+19tYk2uBgtZv8s8ivxAjlszx+ygRl3x3XQCJe6TmhMIoU1J0Oxu7k6S2X7NCbzS0HDfgYWn3mHvs6PCzvbCrVtPbBfKb9TtK55yDXdp3PcnD1QhFlDefMOfoqfMAiVDknZBQQaZOQDkgZQUvS7JEgpMEsGgvevTgzYEg7YrT3v3pr3aAP2VU1jwbVv2rRRb6Sh6qu3/7VBhplAuVAiVwlJhjbBdULSFjGgkLaacCSPFZEhmaZgmepUsyGJ9sI+YoeQA5VI4TMyRhsp+ZRyMY/mYL+YqxVDMZuAMcZo0XS5QFsICVo7l4kJpibwCVrCVuFJcKVXKtVDLNgib8DnxOWmDvEt6Wa5XTioXlYByj5aJZRpZJkvo/yZ7hD3yJn+4VSxoy8N9N+oIob6E0GJCyMzu8+dIWpIoOlA0aBdJpLwRHYLAzA7tv5AOo4lpF7PJoBiMDkUxDDQZRCYqhJ7QUSNmLUEAtVdAD42oV7WTQ4dP7sRTqwffXjYHczqKuWH/EM9/hO9zJlE0RYpuU7Kpv3iXaaw43jDBVGxaxJaIiwwLTM+Iy00bxa3iBsOzphrTTrZbfFXcYXjJVGfymFCUJKPJHIluyW2MNKdhspRk7GL2WvuwbOwl3W3oacw2Z1iHYY402Djc7Lfma3IQ8nG8NE7ON4xTxhnzzbnWEutjrML6PFtn2Mu2G+qt71svWgPWdO09m5BgZPRHgItFfBbb8zE/yo9+zF7j8z9maSxNLGi/2H6KNfChwnAhlM9j1bqWUjTQtNTOVvnvMyiC0QF2DWYAu81hB7vVYbGCdrFZTWaTxWE2mwZazUYVzFIVHreZm1Sb1WIyygiKXbSb1U4BKDrs5ttgNwdfmuqoq2RqLR0Z5D+EXtHeIof5NMyvySApshGtoaYwq2pNsGZZh5keMI2yTjRONM00VVmXWtdanSYgIsySxWwz28OYW1BFVQozucwuS6Qt0p4CieRPvaJXSlNSjUmmRHOiJcXaxdbF7nX0IjvIEjLEDKm3qae5p6W3NduWbc9w3At+5hf86Bf9kl/2G/zKQONg0xDrMNswu9+RB6PZaGEs5oq5JJ+xJJ/xxvGmseaxlnxbvj3XUcyKhemmGbYZ9gJHufKY7TH7SnjauMK8wrLSutK20r7RuN683rLJtsm+3bzdste2117veN9x0RFwTCVZSjYW3OINYJo8M4W1o9Y9vnb2yLzMON43aErT31myaWhlnjiqbR3O1iQ5gSLnJyRJI7zgj1SC7+nIWAYqu6EJd0sKMhCZbOp84WsJ2kXwjWlIXqeNNPs63uO1/OJFnn+gJIQKycIQYZhBMit2czhGKV0Vr7knZisZZg2twTpa9ynjMV/5lbmAFQjFWCAWSJOVCvNS86vmqI43fNrbfRY3D2e2jxQOtT0hHGqfKhbsavtk7S5M6shypD20d02Eof6QZD2pscSFW2MUhyVOdY1M0mKGT4saaj8tl2m+C/wOo9Wx2ylEVkH4BjnW2WS2p/f70ufj/a76KMHxZdyR1NxKbPTAYtAeaHFP2tOZ5XCDnujsr5+Sksx+uiPj6cx6NqamTp8SzH7Stf+rEL1uyswf9DuiciBMCbW7REXBUJM8MvIWvbwf7af8ToVCoVplCz8ResC2wQhNEtOovcr1978+2vMHKE2voXRd1ZP1X5JMFDPa/YvDg5S+8lqjRvlPjY1a7tFJ45FXNaLZwW86MO2gsZffHpZDGY3Joiiq6LSNDNXoC5KnUUf5426jSEmlw9hkFTTCuE4V06H7eWpI3gQa+ZFb2aGWREjlP8sPaXW5jVZPg+6UU4Sn54R1VbqoUW4lsosRYmUlMcYYnzyyxy2gmn3auU2HKywqNmF3ooPy3O4nuhxQYUOoIbEpIjpOSzN8Ps2XqC0++gtKuUOavXr2uglVp8xvS2glQk9LajXxjvekjHqSRH2/cFCDs0P2SGiS2INSHusJSdHA7AS3kzWhE1udu2QY5Q9NzbEqami4S1GN2j9Z4qKMsQkjU27jTGdMV4Nwj3d3nEOosiRvcBvimuyRMUGWrvf7JT89M38G/8+y86Ce3i6LDj5u8rD3drnclI32VYKQvzHhUqT3V/Z+P0Cson+E8OGvbVc6rz9+1DbSlm/UvhpSbn6zQOMMc3g0gI3/+FHraFv+L75v6Cae1b8ZAEFLrVfp5RMpDKqoXKJSS2UzlSIqW6hUU9lFZRWVZVIbnBbPUbkcaBW/gnLJBUfFYphH13liC8wTPoR0rS5kwykhO/CJdjWcgaOSk/p9ofc7isOp3hVKMAF60f39YhP0p9JXv2bCBOlJmsMULIYmkmL6HfS7YBw8wyxsKftM6Ce8LLTgWHxdDBHHiKukOGmO1CCL8lz5qCHGkGN40vCiIij3KNXKn4xuY7OJme4znTfXWsotl6zZ1lXWd23Ztud0hLpjHnSF6WAhq1Bho4ao6BZC6ap9z2CAidr/sEUjAZyhf/+h1RmEUitYF0BhOR11vO2+eFtdgnA2qqMug4sVw31QAqWwGObDDJhGqy8AL6TCFLJHL/ggg45Mqk2mHl4YSH0WQBmV+TAVCmEOdKO7w2Au9e9BtXthNh1eePDmXGV6aypdp9KYRXQuop6mf2PVnjdX1b6gWURraV8tzKXeGh2FNOZ/t+Igqs2kceNgIfWYQn0L9dmm6iMKdY68NMtcOpdSn8k07wzq56XxJbR6of7s5/OM0WcpI4pK6JhFd7VVy6hviT6Tj9bOhKw7RnWOEYLKFPi1/s3TL3/ddVvSvoBz6PmKW//aLpwiSSREUY5+F83ci2YeDDkwBIaSHIbDSLgfHoBcGE0YjIGxtN54iv/58BA8DJOYACfhlPZWy7Bw7oycgRkZHdfMjuvdDcJSf+AGx1YX/pSEf/fhj7X4Nxv+wPE6x/9Owr/a8C+1eC0Jv3/6Xul7jldr8btabGnFb1vxvzh+0we/HohfcfzSh19cGSN9UYtXqOOVMXj583Tpcit+no6XOP6Z40Uf/smFn9Xipxw/ceJ/PoEfH8P/4PgRdf/oCbxwfoh04Qk8PwTP/TFKOsfxj1H4IccPOL7P8Q8cz9bie2dipPc4nonBd314muNbKxzSWx78fSg2c3yT4xscT3E8yfF1jic4HufYxPEYx6MObKxMkho5Nhw5JjVwPHJ4knTkGB5ZKh7+XZJ0eJI/gIf94u+S8BDH12rxIMcDHOs5vspxfxG+YsN9e5OkfUW4d49T2puEe5y4m4je3Yq7OL7McSfHHU7czvGlbTbpJR9us+GLRVhHXepqcSvHLS9YKDPFFyy4+fkIaXMRPr9JlZ6PwE0qbjThcxw31FqlDRxrrbieBq2vxXVrbdK6VFxrw2dbcU3NMWkNx5rqSVLNMaxZKlb/NkmqnoTVfvG3SfgMx9WrekirOa7qgU8Tm0/fiyufMksrXfiUGavoRlURVhJSlUm4woG/4fjkcof0JMflDlzGcSnHCo7+wK+feEL6NccnnsDHi7A8zy2VJ+ESjos5PmbDRy24yIQLOS5oxbJWnN+K81qxlGMJx7kcZ8fhLI4zHQOlmWNwBsfpT+A0ahRznMqxiOMUjpM5FvbBglZ8xIKTOD7EcSLH/AkmKb8VJ5hwfGiENN6H4ziOpZXHDsQ8N45hqjQmHB904ejhIdJojrlmfIDjqPtVaRTH+1UcyXEEPRnBcfgwVRoegsOirdIwFYdacQjHnFocXIuDON4ndJfua8WBx/DeEejnOIDjPf2d0j0u7N/PLvV3Yr++VqmfP2DHvlbswzGbY+9eLql3K/bqqUq9XNgzyyz1VDHLjHfHYKYVfXeZJR/Hu8yYkW6WMqyYbsYe3Y1SDxW7G7GbD7t2SZK6FmGXNKfUJQnTnJiakiSl3ospSZicZJaS7ZhkxkSOCRzj7RhHfMY50VuEsa0YQyzEFGG0FT2EoIdjVCtGDsQIakRwDC/CMEIqjGMoDQqNQDdHF8cQjk7q4OSUI3eXHANRfQLtRWjjaLWESlaOFuptCUUzR5OKRo4KdVM4GlwoF6FID0XSADfSXeQoUFvojkxF4MgaWNGKZ1jX/x9+8P+agP/rL/p/ADs9e58KZW5kc3RyZWFtCmVuZG9iagoxNyAwIG9iago8PCAvTGVuZ3RoIDgzIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nKWMRwqAUAwFB+zdb++93P+GZuFWEXwwjwkJgZ/RXna6YGBiYePg4uETEH7+HQnx7erxKiElI6egpJK5ppFu6egZGMUnZhZWNvGdg/MCgMUDCgplbmRzdHJlYW0KZW5kb2JqCjIwIDAgb2JqCjw8IC9MZW5ndGggMzU0IC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nF1SPW+DMBTc+RUe0yEiEDCNhJCqdGHoh0o7VRkIfkRIxViGDPz72j6bSrUEpzu/ez7wi8/1cy2HhcXveuoaWlg/SKFpnu66I3al2yCjJGVi6BbP3LsbWxXFxtys80JjLfspKksWf5jNedEr2z2J6UoPEWMsftOC9CBvbPd1biA1d6V+aCS5sENUVUxQb9q9tOq1HYnFzryvhdkflnVvbH8Vn6siljqeIFI3CZpV25Fu5Y2i8mBWxcrerCoiKf7tJ9527bf61NYDvoEXJxNk8jJBPqL6eIIcaAJIAUdABsgBHFAAHkMb1zXDYZk/LPOH5eie+2i5j5ajsYVvoJM5QvACcqDIxJGJw8qRifPgQIMTaOsbbBS7naOFD+MpFwB8AO9DjbMUOLvgsASKKAWiFHmogQU/pvAZAj0F9WIvNdyevV87jNvwdHetzdy4iXUDY0dlkLQNtZqUddnnF8/7ylUKZW5kc3RyZWFtCmVuZG9iagoxNSAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvQ0lERm9udFR5cGUyIC9CYXNlRm9udCAvQk1RUURWK0RlamFWdVNhbnMKL0NJRFN5c3RlbUluZm8gPDwgL1JlZ2lzdHJ5IChBZG9iZSkgL09yZGVyaW5nIChJZGVudGl0eSkgL1N1cHBsZW1lbnQgMCA+PgovRm9udERlc2NyaXB0b3IgMTQgMCBSIC9XIDE5IDAgUiAvQ0lEVG9HSURNYXAgMTcgMCBSID4+CmVuZG9iagoxNiAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTAgL0Jhc2VGb250IC9CTVFRRFYrRGVqYVZ1U2FucwovRW5jb2RpbmcgL0lkZW50aXR5LUggL0Rlc2NlbmRhbnRGb250cyBbIDE1IDAgUiBdIC9Ub1VuaWNvZGUgMjAgMCBSID4+CmVuZG9iagoxNCAwIG9iago8PCAvVHlwZSAvRm9udERlc2NyaXB0b3IgL0ZvbnROYW1lIC9CTVFRRFYrRGVqYVZ1U2FucyAvRmxhZ3MgMzIKL0ZvbnRCQm94IFsgLTEwMjEgLTQ2MyAxNzk0IDEyMzMgXSAvQXNjZW50IDkyOSAvRGVzY2VudCAtMjM2IC9DYXBIZWlnaHQgMAovWEhlaWdodCAwIC9JdGFsaWNBbmdsZSAwIC9TdGVtViAwIC9Gb250RmlsZTIgMTggMCBSIC9NYXhXaWR0aCA5NzQgPj4KZW5kb2JqCjE5IDAgb2JqClsgMzIgWyAzMTggXSA0NiBbIDMxOCBdIDQ4IFsgNjM2IDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IF0gNzgKWyA3NDggXSA4MCBbIDYwMyBdIDg0IFsgNjExIF0gOTcgWyA2MTMgNjM1IDU1MCA2MzUgNjE1IDM1MiA2MzUgXSAxMDUKWyAyNzggMjc4IF0gMTA4IFsgMjc4IDk3NCA2MzQgNjEyIDYzNSBdIDExNCBbIDQxMSA1MjEgMzkyIDYzNCA1OTIgXSAxMjAKWyA1OTIgNTkyIDUyNSBdIF0KZW5kb2JqCjMgMCBvYmoKPDwgL0YxIDE2IDAgUiA+PgplbmRvYmoKNCAwIG9iago8PCAvQTEgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMCAvY2EgMSA+PgovQTIgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMSAvY2EgMSA+PiA+PgplbmRvYmoKNSAwIG9iago8PCA+PgplbmRvYmoKNiAwIG9iago8PCA+PgplbmRvYmoKNyAwIG9iago8PCAvSTEgMTMgMCBSID4+CmVuZG9iagoxMyAwIG9iago8PCAvVHlwZSAvWE9iamVjdCAvU3VidHlwZSAvSW1hZ2UgL1dpZHRoIDQ2MiAvSGVpZ2h0IDQ2MwovQ29sb3JTcGFjZSBbL0luZGV4ZWQgL0RldmljZVJHQiAyOSAo9/v/9vr+9fn+9Pn+8vf97/b87/X87PT77PP76/P74u344e343+v33er23On22uj11+b01eX01OTzz+HyzN/xqs/losvinsrherbZWqPPVJ7NMIC9CEyWCDBrKV0KL0JpdHNQZXJDb21wb25lbnQgOCAvRmlsdGVyIC9GbGF0ZURlY29kZQovRGVjb2RlUGFybXMgPDwgL1ByZWRpY3RvciAxMCAvQ29sb3JzIDEgL0NvbHVtbnMgNDYyID4+IC9MZW5ndGggMjEgMCBSID4+CnN0cmVhbQp4nO3RyVECUQBAwZFFGRZBUETB/NM0Bqpe/VN3Cj09SsfSqnQozaWX0qRTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1KlTp06dOnXq1Dmic1f6K21K69JraVHSqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06deoc0nkq/Za+S9tS2jmldOrUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06R3ReS+fSqvRTOpQWJZ06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06n+38LL2X1qV96V56K+nUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06dQ7p/CjdSsvSXLqUvko6derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp06derUqVOnTp0jOv8B5Zr9pQplbmRzdHJlYW0KZW5kb2JqCjIxIDAgb2JqCjEwMzQKZW5kb2JqCjIgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9LaWRzIFsgMTEgMCBSIF0gL0NvdW50IDEgPj4KZW5kb2JqCjIyIDAgb2JqCjw8IC9DcmVhdG9yIChNYXRwbG90bGliIHYzLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZykKL1Byb2R1Y2VyIChNYXRwbG90bGliIHBkZiBiYWNrZW5kIHYzLjcuMSkgL0NyZWF0aW9uRGF0ZSAoRDoyMDI0MDUyOTIwMTExNVopCj4+CmVuZG9iagp4cmVmCjAgMjMKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMDE2IDAwMDAwIG4gCjAwMDAwMTMyMjUgMDAwMDAgbiAKMDAwMDAxMTYyOSAwMDAwMCBuIAowMDAwMDExNjYxIDAwMDAwIG4gCjAwMDAwMTE3NjAgMDAwMDAgbiAKMDAwMDAxMTc4MSAwMDAwMCBuIAowMDAwMDExODAyIDAwMDAwIG4gCjAwMDAwMDAwNjUgMDAwMDAgbiAKMDAwMDAwMDM0NCAwMDAwMCBuIAowMDAwMDAxNTQzIDAwMDAwIG4gCjAwMDAwMDAyMDggMDAwMDAgbiAKMDAwMDAwMTUyMiAwMDAwMCBuIAowMDAwMDExODM0IDAwMDAwIG4gCjAwMDAwMTExNTYgMDAwMDAgbiAKMDAwMDAxMDc5NiAwMDAwMCBuIAowMDAwMDExMDA5IDAwMDAwIG4gCjAwMDAwMTAyMTQgMDAwMDAgbiAKMDAwMDAwMTU2MyAwMDAwMCBuIAowMDAwMDExMzgwIDAwMDAwIG4gCjAwMDAwMTAzNjkgMDAwMDAgbiAKMDAwMDAxMzIwNCAwMDAwMCBuIAowMDAwMDEzMjg1IDAwMDAwIG4gCnRyYWlsZXIKPDwgL1NpemUgMjMgL1Jvb3QgMSAwIFIgL0luZm8gMjIgMCBSID4+CnN0YXJ0eHJlZgoxMzQzNgolJUVPRgo=\n"
},
"metadata": {}
}
],
"source": [
"from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix\n",
"\n",
"def plot_confusion_matrix(y_preds, y_true, labels):\n",
" cm = confusion_matrix(y_true, y_preds, normalize=\"true\")\n",
" fig, ax = plt.subplots(figsize=(6, 6))\n",
" disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)\n",
" disp.plot(cmap=\"Blues\", values_format=\".2f\", ax=ax, colorbar=False)\n",
" plt.title(\"Normalized confusion matrix\")\n",
" plt.show()\n",
"\n",
"y_preds = lr_clf.predict(X_valid)\n",
"plot_confusion_matrix(y_preds, y_valid, labels)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kKKoY7wsYHZq"
},
"source": [
"We can see that `anger` and `fear` are most often confused with `sadness`, which agrees with the observation we made when visualizing the embeddings. Also, `love` and `surprise` are frequently mistaken for `joy`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Va06uCGWYHZq"
},
"source": [
"In the next section we will explore the fine-tuning approach, which leads to superior classification performance. It is, however, important to note that doing this requires more computational resources, such as GPUs, that might not be available in your organization. In cases like these, a feature-based approach can be a good compromise between doing traditional machine learning and deep learning."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9po3YY14YHZq"
},
"source": [
"### Fine-Tuning Transformers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BRMmvwPpYHZq"
},
"source": [
"\n",
"Let's now explore what it takes to fine-tune a transformer end-to-end. With the fine-tuning approach we do not use the hidden states as fixed features, but instead train them as shown in <>. This requires the classification head to be differentiable, which is why this method usually uses a neural network for classification.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6D83BrpcYHZr"
},
"source": [
"Training the hidden states that serve as inputs to the classification model will help us avoid the problem of working with data that may not be well suited for the classification task. Instead, the initial hidden states adapt during training to decrease the model loss and thus increase its performance.\n",
"\n",
"We'll be using the `Trainer` API from image:images/logo.png[hf,13,13] Transformers to simplify the training loop. Let's look at the ingredients we need to set one up!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "22o5RmRGYHZr"
},
"source": [
"#### Loading a pretrained model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TG2KaelrYHZr"
},
"source": [
"The first thing we need is a pretrained DistilBERT model like the one we used in the feature-based approach. The only slight modification is that we use the `AutoModelForSequenceClassification` model instead of `AutoModel`. The difference is that the `AutoModelForSequenceClassification` model has a classification head on top of the pretrained model outputs, which can be easily trained with the base model. We just need to specify how many labels the model has to predict (six in our case), since this dictates the number of outputs the classification head has:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"id": "U5AecdTvYHZr"
},
"outputs": [],
"source": [
"# hide_output\n",
"from transformers import AutoModelForSequenceClassification\n",
"\n",
"num_labels = 6\n",
"model = (AutoModelForSequenceClassification\n",
" .from_pretrained(model_ckpt, num_labels=num_labels)\n",
" .to(device))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tMfv5H4MYHZr"
},
"source": [
"You will see a warning that some parts of the model are randomly initialized. This is normal since the classification head has not yet been trained. The next step is to define the metrics that we'll use to evaluate our model's performance during fine-tuning."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GMZfOX4-YHZr"
},
"source": [
"#### Defining the performance metrics"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i1DTcYEIYHZr"
},
"source": [
"\n",
"To monitor metrics during training, we need to define a `compute_metrics()` function for the `Trainer`. This function receives an `EvalPrediction` object (which is a named tuple with `predictions` and `label_ids` attributes) and needs to return a dictionary that maps each metric's name to its value. For our application, we'll compute the $F_1$-score and the accuracy of the model as follows:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"id": "_hqulHJcYHZr"
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, f1_score\n",
"\n",
"def compute_metrics(pred):\n",
" labels = pred.label_ids\n",
" preds = pred.predictions.argmax(-1)\n",
" f1 = f1_score(labels, preds, average=\"weighted\")\n",
" acc = accuracy_score(labels, preds)\n",
" return {\"accuracy\": acc, \"f1\": f1}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MTdFE1SdYHZr"
},
"source": [
"With the dataset and metrics ready, we just have two final things to take care of before we define the `Trainer` class:\n",
"\n",
"1. Log in to our account on the Hugging Face Hub. This will allow us to push our fine-tuned model to our account on the Hub and share it with the community.\n",
"2. Define all the hyperparameters for the training run.\n",
"\n",
"We'll tackle these steps in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AGjz7aAyYHZs"
},
"source": [
"#### Training the model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uPneJkQ1YHZs"
},
"source": [
"If you're running this code in a Jupyter notebook, you can log in to the Hub with the following helper function:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"id": "nYBCECLFYHZs",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 162,
"referenced_widgets": [
"642417ea01e24915b65da150124d01da",
"1c4b233289a4422e9d758fd1f6e30031",
"fdd33ad1a8fc48588fbfd18eac7e3e4f",
"6bb03feac2d348748a1b5d878e7fc68a",
"5a0680163c7d4cd39dde8f4ec58da9cc",
"86479ee5cea7476880e595572aa2040a",
"464143e721f8449ca9308a5d61ad5a1d",
"0d2c23f33edd41208a4042ec9dbe7d4f",
"fd47746280514956941afc6002d8de1b",
"68828b8f0c1e4b34b82159ed66b96b71",
"2d81a8f563844079b3a8af53a9a5000e",
"a006e55ce254455fae1b73c42cf90aa7",
"20d3b4d0faf34fe7b9d9e911b35a869e",
"400a01a35a334474afffef66bdb6d323",
"359d985a530043bbb57f82c0112f53b8",
"6a8fa56aed9a445aa85c546351d75b2d",
"9451f7610cca46879351926a62bf4c34",
"dd803abc08be4f54b2fe47bef68ce566",
"0b3cf559f10043319e1153a21aa599ca",
"820c8396fc8348dc82386c5bfd6287d2",
"3203ee75d70842208df43f9ef633d8ea",
"562b7ce7964f477eba0bf9e0bb394169",
"103530bb23774dc9822a9c7eb35a3987",
"01addce02fea41d3900521870cc5f590",
"c6fb89466b3a4bc880ff07ef2e08c25a",
"29715012e05f41dcaa6d8dae9c011800",
"5abfb1d770304119aab8da024a5da3e5",
"e65677e8b4fc422bb3cd2cdc9f0e3020",
"b928f790a6f545beb3e4773a469950fb",
"12307eb92a5741ef87c046201b170bc8",
"8ed20866b8364a79933287fc200d614a",
"6159f819197e438ab0d5db80b815c3d5",
"5509b23a87564c8996dc2f9f5fb51039",
"5e637addbfa0463bac41215205bb7da9",
"af1d2ee1869c44e9bd6d13a9d7bcd460",
"74c6796966f74e15bde7a39e7fe707fd",
"a875372184584f07a0c83679d61d2709",
"342003f6a8b646ab8ca4a9476262f7f9",
"e26de86da09e4f99ba6e33dde6dc668e",
"cfe76db1392149c2bbf90166e484496d",
"7c461d8934d44ce18d32b3f18c3e5be9",
"41fd3f461df54f9380be1225b08543c5",
"bdb2c999cd87447894f7c4a004be1031",
"3fb820ff1df74c97890e106a50198358",
"e4a1bbdb599e444da824089b1a1d7966",
"7dfc741abbcb4eacb73e13dc6020a625",
"44e9346f2dd841f5a32dc080d4b07947"
]
},
"outputId": "e2a5205f-27a4-4156-8a65-dbe0e06db39c"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"VBox(children=(HTML(value='
"
]
},
"metadata": {}
}
],
"source": [
"from transformers import Trainer\n",
"\n",
"trainer = Trainer(model=model, args=training_args,\n",
" compute_metrics=compute_metrics,\n",
" train_dataset=emotions_encoded[\"train\"],\n",
" eval_dataset=emotions_encoded[\"validation\"],\n",
" tokenizer=tokenizer)\n",
"trainer.train();"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FVTp7EMlYHZt"
},
"source": [
"Looking at the logs, we can see that our model has an $F_1$-score on the validation set of around 92% - this is a significant improvement over the feature-based approach!\n",
"\n",
"We can take a more detailed look at the training metrics by calculating the confusion matrix. To visualize the confusion matrix, we first need to get the predictions on the validation set. The `predict()` method of the `Trainer` class returns several useful objects we can use for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"id": "80C4Ag8BYHZt",
"outputId": "93f1a77d-3a4c-4063-f1a7-263190973690",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
""
],
"text/html": []
},
"metadata": {}
}
],
"source": [
"# hide_output\n",
"preds_output = trainer.predict(emotions_encoded[\"validation\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zLyiD7mcYHZt"
},
"source": [
"The output of the `predict()` method is a `PredictionOutput` object that contains arrays of `predictions` and `label_ids`, along with the metrics we passed to the trainer. For example, the metrics on the validation set can be accessed as follows:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"id": "2iYMm8hDYHZt",
"outputId": "6022829e-b65f-4686-d26c-416339174a91",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'test_loss': 0.22519247233867645,\n",
" 'test_accuracy': 0.928,\n",
" 'test_f1': 0.9279227556172475,\n",
" 'test_runtime': 4.206,\n",
" 'test_samples_per_second': 475.509,\n",
" 'test_steps_per_second': 7.608}"
]
},
"metadata": {},
"execution_count": 59
}
],
"source": [
"preds_output.metrics"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Uyr0k7NVYHZt"
},
"source": [
"It also contains the raw predictions for each class. We can decode the predictions greedily using `np.argmax()`. This yields the predicted labels and has the same format as the labels returned by the Scikit-Learn models in the feature-based approach:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"id": "I5HTrSmqYHZt"
},
"outputs": [],
"source": [
"y_preds = np.argmax(preds_output.predictions, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "woGFjc1DYHZu"
},
"source": [
"With the predictions, we can plot the confusion matrix again:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"id": "YkWeB5huYHZu",
"outputId": "55da1e5b-79e0-45e2-a2e3-a1d0b784ff0d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 551
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"
"
],
"image/svg+xml": "\n\n\n",
"application/pdf": "JVBERi0xLjQKJazcIKu6CjEgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSID4+CmVuZG9iago4IDAgb2JqCjw8IC9Gb250IDMgMCBSIC9YT2JqZWN0IDcgMCBSIC9FeHRHU3RhdGUgNCAwIFIgL1BhdHRlcm4gNSAwIFIKL1NoYWRpbmcgNiAwIFIgL1Byb2NTZXQgWyAvUERGIC9UZXh0IC9JbWFnZUIgL0ltYWdlQyAvSW1hZ2VJIF0gPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9QYWdlIC9QYXJlbnQgMiAwIFIgL1Jlc291cmNlcyA4IDAgUgovTWVkaWFCb3ggWyAwIDAgNDE4LjM5OTM3NSAzOTcuOTMwNjI1IF0gL0NvbnRlbnRzIDkgMCBSIC9Bbm5vdHMgMTAgMCBSID4+CmVuZG9iago5IDAgb2JqCjw8IC9MZW5ndGggMTIgMCBSIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nM1ZTW8bNxDlmb+Cx+SQEYffPMZNIjSHoHYE9FD0oCqya0O2aymJ0/76DldSRFrLDWUFiQWsvTvivvfIeUNyV6NX88+Xs/nZ+ET88p6PdlezFUdxRceFkOKKjnuBYkzHBZd0dc0NBtAxam/pcpFf6ughaumUpbgsL//m/JyPXhLMim4bc+4DWNvdZhCM7YARMGahRR7S3oCMFNvd+DXUgd+Jh5BaK3Bm+285F7+LGzF6qZIG6iMd90mLKHt/t70hdVdrDdrtIc+uxehXFK9uxSk/FXdbUEmDloAlhA00RThKB8oXXc1CmgYpdZWf0Jjf8zv6K8ULSTgBIaLCQI2UBa0SLT+Z8NEbFKjE5LxLyOQD/0M8Yys2ZR/YDZvT2eq5+FNM3vLXE37KOy0cHYLHUsMuVNeAVkFQbRKu2C37d59aoQcsM52F6tRKGrBeG91EviDyz2zeQ+8V2FjS70ID9NYC6tBIP6Whv6DBX+4L0CqA1IWALFQXoNGBDKl6vk1/TtTTXvKgwZTey0ID5DaCRquxiX7FPrEl+4eOSzovklAkVIKK64RG8EOAvz0X1PuovAw6fShEnVPUJ739UGhOjr9kM/axOxPkgSn7i84X++OQTRrGgfY0Dt6CLEK9Lkj2Ty0QgvTHVOBOgYqS5sVCwSbUp8Bq8D50eQgk0w2LqNRgRq4tmFiSr0N95GlguuLXMmXjcRW4I8cgibQg34R6PejAdU2otQnD5AP1l9ErAw5L+nWot+8e3LoAkIwnzbCAagXu6F0ELI23jvT6ToPquB3p+6bxBqovgb1IsFRPqYox6Ymywx7AnFABGluW25J45rUyU+Lter/QLX3lirq/1Pes4fz9/gbguncDQC2bNg9Fu83dVUSZ9EN0NOF6YzGd09yP2tM6QEO4vNjtGOqtzsZ8ECPPcFRfV3eaEskRw0mWDFhkrvSXXINT8UrAEKiQVTSJ12Ak1yTypLzWkBcNz8bVhg8Rix2Cof2fPawjkmF1jaDpLh6KJmtotIop/b20aaRa9IeiqRqapxE+0AD1nmZuot0aTWkHd3TY/IOmz7/lg6bPzdKqMzLz9Fyf+7R9wE2DT9vRal7IffpIM1R8ery23Kf0lIX+mBIq3NSIhqW0H+L6wiyNOgPzT9D1uU+b09cyn7ajVV2f+/RotMyn6B143QKmG3zajtayRLaj/QTXZ2Zp1fkk5/rcp8enL/fpI9NX8alCerz+bj5tRmvyaTNay1auHc3+cNcXZmnUGVh4gq7PfXq0tTKfOnqsPGofntu0Gcw3uPRoZblJm8FatnGNYD9jc5P7pFFmeCCz+6lh/a68SDM9x2P384FOL4liDmvAZMAd8jt2m96/oAftcPPhz9g1m7IFu2T/bV4SzqjZDTtnn9iKoulcdG0+du9PvpAwvhV2yv8Hs6LqvQplbmRzdHJlYW0KZW5kb2JqCjEyIDAgb2JqCjEwNzAKZW5kb2JqCjEwIDAgb2JqClsgXQplbmRvYmoKMTggMCBvYmoKPDwgL0xlbmd0aDEgMTI0MjAgL0xlbmd0aCA4NTYyIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nNV6eXwUVbb/uXWqqqv36k531k7S2RO2xI4BwiItQljFIGEJipNACDsJBFCITgCHRAQnQSAIIkQEZBMDMpBAQNCMisioA/h7PnUYBLdnRGYejmNIbr9T1QmLzvxm3l+/z68rt+reqruc8z3bPZUCBgBOOongHTJocA4MgDEArBvd9QzJfWBM1I2UZmoPBRCeHTJm7MD1X+86D4Cf0vPr99+bN9Q4IPM3NPgDav/tgTHpvhk/lawAkN6h9rgpcwpLheGW8QCyl+bwTlm0wAszorMBDH5q8+LSaXPm3b1oJoCR2rB3WmFZKRjoAGMLtS3TZi8uXnHuP++i9t8BojOmTy0sUia9UQmQkkrPe06nG9atho3ULqB24vQ5Cx57a1nIy9QmmuDg7JIphauHrSV6U3tT++E5hY+VipvkedReR23v3MI5U1O+699E7UNEz/nSkrIFV7+/HA3Q5Sl6vq50/tTSvoa/ULWrh3iaDhpWFgj+BDoQUuheKnSjuvbMBD2gHwiDckbmgW124YK5EE640i8QALhZ03qyWVPnzwVFq+lFpBm0qwICO6T1ZO8xTvO5Afgpfdx2/XxEP5+j8mngCJVPtVoHRRCwai3+Cqe+PNhfH8u383P8Fbqe1lun4N/6BecNzPrXfW6n4d+f+V/0CXI9qKN1jng9d5OmT7X2zZ46V4HC/w0FnVQEbsMi8Atc+L/LFSNJqeCAUAiDaIgBLyRCEiTTfe0eu9lP05hbPxEkOssdLQPJ3qhrQsJtI+78aeM1LZJoVLC/iVpmulqAZA82sOt0OCEEXAAdOrkF9oFL18klhfMLJ0NN4fw5c6Fm8vzCGVAzpXBuGZ2nT51P58XzZ0PNtKklVJ82f+osqJleOJf6TJ86me7MKpxbCDWzC0u82pl0+zdzChdMh5q5s7Q7JdMK50DN/IVzqeeC4rnT6Dxdm/+f6L+O2uwZ0wrvsAERgjbAoKd+lYgvF3ggXrcxJHSjoKt+jYAuhEYUpNE5gp4KdI/REdbB85/ha8iFfhYwfKYDt5yWLyDgLnRAX3AnsJ1tNr2jPvbfELqN5l30r/vBxGBf7Xqzftscd9yfeFt7/q26QP7rRjXV+90cmkpcmoKTBM+ijdVoGiVlSuQRWUzwiv8HigUnzWCWERVREMSbIzp+ucWDi8BPOrtEdnEX22SYwy7f0Qc7CvlAfcXV1GJ6W4RSuoaTziFJy0veLxtGEe55MA1m0rNFsESXvPYko+NJIcyA2TAfHgsEApcDHwUuBE4FTgaOBw4E6gOvBvYHXvmn2t/pe9d1tMz6zMGi0dejo2gcZlChOKNb2KiOYteY7SgqlbyO4qCieY9pVAgpog+IeiC9AqIUdB6jqGjyWNSx3mNUllBJYlnQAGfoOAV7YDPbSa1iuj+P7tQJB2EFLKQ7b7IzbKXQne7thGtwjnpWwRncIwIbDpl0F+BjSYDrLA8O0RzZzMWyDTKZwyjxkPig2CB+JZ6FXmKZeFYsEMtYJm6Txkk7qWTj70m2pyEWGthFKIOj+A1mYpM4SLTBRTyLe+ALWkXD4wxUw3YoJ1pcrAQqhHLhQbrztnQWNtFRQs/Psi3sHFF3lD0JF+A5FIWhsIVdIL7OwN/gScwTKgjmTKGY6H+b5jpL4zdBGTmxC8wEXOhK94h6Wmuyfo7G7tIF/bgGFbRyHmyXG2SXIYFW0RDbyd5kLfJaqINz+DDOw0/YCjFB3CUOheogAlgA1TT3Jm2MXMwWE+/aUa7NLjwqFrA98I1YYJhMc/9e44jWPCQ8SBwVQxOVR2WVeOrLVuBKolR7Gg1nDcPFdBpPMxieIK4BSjCLZF1Cz/fDQeiOtVBNM+n8yr2kv9HIzeIl4rmaPSP8Dc7iIPI4xeJVwlpzsLUARwyyJKLAoJtXrReShhXV+0dP8L6TH9e928+aXtXgrYfceutib0MgkDtBjJLy6yVPPSYp9WJSwqV/9vBS924jcid469sHD+qYdXDBILo3ZgJVtRbdpvuDB+nPtEXrpST6G1ZQ750y3fu0+nRCn6fVqX26azYjaDsMsknNgj8hI98PnOqOI/CkwCIgXFTbWu6C9JaMXpnuhE/OneOcRlQFLovVJEUzedcEf4hc54Q6yxrn6nCjxx6DHndUOI26TuPUK9db1KsZLF5wqM5Mn9OhCik+cKiQEK+dhVWbX3iB/l544QYz8h9v3OA/MqOUy8/y96icZZl03M0y63gZr+RVvIw9wxazJewZzRddIrOeSN7bBH6/eyDWiUKdtMwAdUYlVvYgxDKzen5EvT1vQiN19vfOb2kOMuK73nK+JYOwyo9nh+xoF4VJveIcUlZSpiPOHcfZcL6RTX2XDW/bvkcsG9owtPXCHpqA5CoOJ449sMWfEhEZheEehySCQ5LEgeqLjnXWOtcakewbVJPATJ4wFeVotW1EvTtvRH1o3kMj6l15DxElGDjZO7/5fMvJkw5ndgc113VqDKr0nUH6jtV7VEdYNtHm940Vx0njDEvEJdKiqKoIA1l/hBhJauBZAIvkhZFlUQs8y6EyYnnk8qjlnl2wK8oxCSYlERNZPaHXPSzr7uSEeNmQdQ/L9Ilul2yQgVzOqbaRBGNm4f0vV/7q3GNLzk/4mrkGPxTBr+/Zs+dRtqbPnA3DHq0deN97d/m+fuPhHaXR/FvifjPJu4y4T4VSfw9wh5gqjbGV3pA6t7XOuFb21HnXJqyRV7tfSgv1hAC6IjzJXtWDrlijnKaBEJrXyb9R558AuN5CXBICasuV61da1C+vqvpBqGQwv7EopjC20FsUJ8IkFsPcLjEuPjklK4YY6UlcdWVZwcod7OGANS/xD/jXj7w9M++dOSfebtyx//D6LS89N+bE/LLT+V8yy28xKba55rO/JiW9eZevtvo363c+WlpWnph8yOv98ODjezVLKCIpbyedEmjvtMwfzaxoBUTrQECzoU5iuMzILCbwyIposamfjqg3E2NWnTGLxtj5fs0tPocm1yvn+7X4iBddsOJpEu5pTaRdzLRJGQr5FFAehafBEMq6QjLrij3ZKPaA5QHrOFbMFrIluIJZSZRGFoeZDjI7R4IjLgtlLjCexS9cON3+iJTUdhnPtmXu4nWs4E2S0BaSUBFRHg2P+BPESIOjUo2OrDO46tSVVqEOlllXG7bHhHmYCT1gUuUYtY3dLhdVI7/DWlTNWkhEavNVzYA1Cybx8OagdEJIvxwa5uB2wR1i0aTxGUa013Wb0K2VJfLz/PtH3pw+8eSsV95995XRL+ZJF/bwZ+12fvW//sJ/8HrP3JVxePPmw4nJhHY1UV+r+5NEmOBPDJHBWmmBulC5zhO6Q62zrIxf41mdZIk3eiJiQjwYFxuVRA6GlOiK7mKutF25pT5+F8VedlY4i2fFM9IZmfg+GCNMYpNYvOx2hQZpZe4eLCFewE5GEryaO4rzhQrbn9q69SkqzDjy+ZHvnLP3PTjrEpP4tc95O7/KclnUyOex79FtLx479uK2o8LihsRk/lf+/fhJ/Ptvv+T/pTuoyWxHjOahdpE2TSeZyDDFHy45BBTQIZK/kEgeKCETGcgGte29ZodmCem3+QEqpC6agCYcp229nyajJJhk5OjVO9/vnCAwGSOlbGmoNA3roV42kLaQYFgCi9uFJ9s/P8d4e6Z0YVzrMqmrthNaRfiu0vFNgHS4z58UTuimyHUx3euca2JWp7yUEW5J7OJxJ3rsRvLe5MLtcVEZaltzy/XmFh3YTlvVW9lkpLeBmdSDfE1ipi9UczK6uSbEJ2bd3TOkswNphrCqZseOmpqdO/iO5Wsg8KeLfM2yZ1/iP/74I/9x+9A1Ty5fu3b5k2uE32+qqtr0fGXVpnHeg0tf++CD15Ye9Ma/Vf3x119/XP0WK1ywfPkCKqQxy4ijKuIoXNeYBENsBKuEiDrTDrEOVobG1qlrQlcnGTyeuJAYiI/3WHWFIfI7Y9KX/IdOfQltjngj8mTUSc/J6DdimmMNe5xNzm+cSBrTS9dtZ4iNdAWy7obMoJbEJ7NOtgiDSyM3jyA96XNw9p/5DaZ+zpA5+AH+xcjN7J4OXYolLWFW5hz3MLN/+yUL1cPZVv5QjLChU5M0fTmt79QvUO622B9jcDCBCQ7y2gMNAsKriiQzg+ARexpok20iZtpagmpDjia709F8p5X8+IOxZsYm+TN6Cr0NQ4UhhhlCsWGpYJCZUXazSDmHDZPHswnyVDZDXiyvYE/L62lPtdWs6lrEHBQImSPhNFOF2mZ+rX1ms3ThRqx4qbWreOlGLGFPdIrPEJ0K5ZR9/OFsvQrrjcucqkmhVEKKsA5wgMcouohGX5vm5nXdppjrN9vdse4B7l+5X3VLZI+ODu1IitMcuNiVFmVr+TObNj3De7N3bjDGAzf4u1J6+/vPVlU+u/PyJ5993r4LWKCV1v+G1jfAcL9NFtbDMpH5ya/5JUU9T75AX8+XQfZj0uxH0e1HAaXTfkLAGAsqsRdrUI1+Y6lxq9E4CTW/Rn5WFr9vv3qm/Sp5q9YLmvUIUE661p12mSbK4Jso+seaw4w22B0mN9oc3srYo57GhAbH6jALhGG41aiYY1FxDU4m9t873+LzBaXUfOV6G6ndW7otObI1vZubEZ0RkxGb4c2Iy4gfkOKP9sf4Y/1ef5w/Pjc6NyY3NtebG5cbn5tSmrIiuiqmKrbKWxW3Ir4mpS7lWkpM59DOQZ0DCmIKYgu8BXGlMaWxpd7SuKUxS2OXepfGhd/uAfuzXo6ELE2tk8lOM+Nuj6WhwomL+5aVbGxsaBjQ9NS+M+03mPDyhoLDeVNPTPzva0Jmcfnkso8PpY1sX7anuPDUtuMnnRWrevTYk5LSpmF1lLDaLrvI03igtz8CGy12Y2O4e7W9IWpDBDidQ8ItshKZE60ph++6viG7ovmYt65mHC6IWRpTF4NEZ6dtEalMd9G0byRaUygUZuIXLz/77Mtaaf9tnwPl70Eg8F75gT6NjUL6ma++OkNFeLCokDfxv9PRVFi0i6hhMC9wGb8iGUbAAH8UVLKnRFul9SlTo0NsDCPhRRqcVhjqGhyptl3xdfoIfv2q+sNVTW2j1KilUTVRdVGa2uqho4O6Xm7dNwRjB3416oXc195667XcF0bdv2NSO/+IdWfy2G1i1r6uXS+fPXu5a9c9iYnEkI05WZ8EQouoEicSfWoQrchGsLkaJWW1rYFtwDARFGGIw2keHK3vpn2+m2g134EWGWxQmLTJBhIgu83v4raGhj4HHj8TgMCZxw+0v0247dpF2OFh4ZGfWnYVFbJBTKFjUCF3d8DXQVcFoeWiLLfUnwhuZqxUnpLcu5nUaGHHwhudDZbVnii3oLgVGCE47YM9OonN+q5WAy8YkK8HPWzagOjS6LroD6KvRUsDYAAbIAxwD4iSuhnSlXRjN1MJlLASocRdEmWcNE8DOE7fUujY6h6XFMCgg24QK9oOWs4emfn25CkfzOLX+dssre1zZmgQdjy1qdEmPDLxxNt3372/SzfWm5lYCLuPf9a84dD+LZp3TSf1/DthHQL5fo+kMouyW2ZVsMEmN5mEEAqvRkmx2s0jXdqO1aRthMzaRmhEvU2vaxukfs1t/ZqbnbpBXyHfpl6lPR7FwsN+d667zk0hw01ERrOgK0nIytSMS/h7/ZT7WTr/sLG+fv9x2bUxd/qU6rZ0/LB61LG9GtZ8nDiRsDbT7nq4PyHCEm10VoaENtqxMTmhIaXJ2Gg/HhmdHAGKZYjsdHoHp+lxOagOzVeCCsEvaEhnk1Z0WdqlrsvPbChMFW7t1PqzDlWhfCw0LCsTt+1Yv27HjnXrdzRw3lq4b/ToLQ/+7lD2wcf/0Nb2h8cPZjcI/d/59NN33v7002/55/yb6JjXunU5/vpDUyazPkzbw/SZPGWPZvmnCOTFhC+SX+5KfvmEeACaBIkpIuQoahuFKQ21Ni0MaF4311hAnpfsKYS0V9vknmqgn1hwo052fUPzBT7h4/T5zGCHQX6PWTCA7YTFUCUdhybLAVVRJfkBK1MskKPqs1/Jdt7ao+hg0EIOvyPXUeAodQQXcskduUNwwZd+l3PXjJH6qqs/Orm5cKOc+g1pyi1O4o/ABoEpkHMzC/ZbVckv5UoFUql0jUKyTj6RLrt+atG0jAK5IZqkGQ8T/cmy0xhuBzna4LZURXuxIaopQjWAw64ocq5Dsed6wskVJuhxso0ipZ4R9ut35bqeRjjDiIOQjMTcxNLEmsQ6Ol5PvJgYSDSSbHU/7SbcgkK+vZLp1h+KaYNPLn/1ROP8hdU7G+c/+szOxsYB9YuX7MWVjy/64fP2h4UtL24+sb29Stiy7fnXX2qvEgv2T5v8eAcHYhFxEAI9yX8bAW1MrrI5GixNJiYoMEqLcDl6cNfddz9N/3RiDxW433cLmj/6B+QUNTz++Pp9jY0DX1t46i1hu0bA1i0aAbTw1KLvO3zOQt0OwsgOQuRGJzRaGrR3Ck77aHS6B//snYI/YUBEOZTLFYYKpcJYYaowl1sqrBW2CnuFWuEod9ZFXItw3Lnrv+PVQ9m6fXvXr923b+015uRXr/2Ff88cePGr06e/+vqdt7/ZzN/hLfw7cjDZ5EdcrLcW28hStxOFmre+xx/V6a0bbKvZcWyKJk89RPfZt0U3Sp86HbbfGPTYf44R2aSkm9B0hLY7Ql5ZY+OtyCb07ox3u9r3y6Y9t8U29m2nyw7KDYcTdQ6gFEg2k56ZscrWYGwymGTa/OQ4Nbeh2yD56fPvaY75UG7I1hBNYsGIdktcYTg8dli3zS8THUdXhPTw4CGn48yJ9oMkrOIpkkSrlVA8fZtWS4Gv/P2sFsFmHhMboxgFg2lMbGzMQJM5JlZ0U5xdKboq3SvDtTibRHE2NcZkjo0ywINRis2guOIHp2pUnW+5ohlsdnZn4P1BC7yaTunbWtt3tIk06Gfa20KKtred4zF5zB5LDwof3czdLH2NfU19zX0tZi94WaKQako1dwlJd6W7u4SmxqTGpnnT4hJTKk2V5kpLpVV7o8sEQTbJZrSgFW1oRxUjMBKj0CNGG1PS0wak/SqtIm1pWk1aXdq1tHDaGM+7Ffdj9XcScsLtyW8605KgnoQdrhq1a+LKlZPXDWje8eN/THxzdvFbhctXT93r3/vcn/9QfEgcsD81NS/PPyzO1mXjys2HExJOZGXljx6Rm2RPXL98yz49c+xFDvCv0hayQdoV2CTFjrvBwZqUKpOZMCYdU502zQb1gOTT41FL0GuQ2zv4qptpVqhFIVdoXy0mJWdp0cjBHmXlfMWIsuPHL2yrqpK28Deq2+tWjtq09Y9CQTW7R/Pi+8kKJ+jW74K+fs8t+19tYk2uBgtZv8s8ivxAjlszx+ygRl3x3XQCJe6TmhMIoU1J0Oxu7k6S2X7NCbzS0HDfgYWn3mHvs6PCzvbCrVtPbBfKb9TtK55yDXdp3PcnD1QhFlDefMOfoqfMAiVDknZBQQaZOQDkgZQUvS7JEgpMEsGgvevTgzYEg7YrT3v3pr3aAP2VU1jwbVv2rRRb6Sh6qu3/7VBhplAuVAiVwlJhjbBdULSFjGgkLaacCSPFZEhmaZgmepUsyGJ9sI+YoeQA5VI4TMyRhsp+ZRyMY/mYL+YqxVDMZuAMcZo0XS5QFsICVo7l4kJpibwCVrCVuFJcKVXKtVDLNgib8DnxOWmDvEt6Wa5XTioXlYByj5aJZRpZJkvo/yZ7hD3yJn+4VSxoy8N9N+oIob6E0GJCyMzu8+dIWpIoOlA0aBdJpLwRHYLAzA7tv5AOo4lpF7PJoBiMDkUxDDQZRCYqhJ7QUSNmLUEAtVdAD42oV7WTQ4dP7sRTqwffXjYHczqKuWH/EM9/hO9zJlE0RYpuU7Kpv3iXaaw43jDBVGxaxJaIiwwLTM+Iy00bxa3iBsOzphrTTrZbfFXcYXjJVGfymFCUJKPJHIluyW2MNKdhspRk7GL2WvuwbOwl3W3oacw2Z1iHYY402Djc7Lfma3IQ8nG8NE7ON4xTxhnzzbnWEutjrML6PFtn2Mu2G+qt71svWgPWdO09m5BgZPRHgItFfBbb8zE/yo9+zF7j8z9maSxNLGi/2H6KNfChwnAhlM9j1bqWUjTQtNTOVvnvMyiC0QF2DWYAu81hB7vVYbGCdrFZTWaTxWE2mwZazUYVzFIVHreZm1Sb1WIyygiKXbSb1U4BKDrs5ttgNwdfmuqoq2RqLR0Z5D+EXtHeIof5NMyvySApshGtoaYwq2pNsGZZh5keMI2yTjRONM00VVmXWtdanSYgIsySxWwz28OYW1BFVQozucwuS6Qt0p4CieRPvaJXSlNSjUmmRHOiJcXaxdbF7nX0IjvIEjLEDKm3qae5p6W3NduWbc9w3At+5hf86Bf9kl/2G/zKQONg0xDrMNswu9+RB6PZaGEs5oq5JJ+xJJ/xxvGmseaxlnxbvj3XUcyKhemmGbYZ9gJHufKY7TH7SnjauMK8wrLSutK20r7RuN683rLJtsm+3bzdste2117veN9x0RFwTCVZSjYW3OINYJo8M4W1o9Y9vnb2yLzMON43aErT31myaWhlnjiqbR3O1iQ5gSLnJyRJI7zgj1SC7+nIWAYqu6EJd0sKMhCZbOp84WsJ2kXwjWlIXqeNNPs63uO1/OJFnn+gJIQKycIQYZhBMit2czhGKV0Vr7knZisZZg2twTpa9ynjMV/5lbmAFQjFWCAWSJOVCvNS86vmqI43fNrbfRY3D2e2jxQOtT0hHGqfKhbsavtk7S5M6shypD20d02Eof6QZD2pscSFW2MUhyVOdY1M0mKGT4saaj8tl2m+C/wOo9Wx2ylEVkH4BjnW2WS2p/f70ufj/a76KMHxZdyR1NxKbPTAYtAeaHFP2tOZ5XCDnujsr5+Sksx+uiPj6cx6NqamTp8SzH7Stf+rEL1uyswf9DuiciBMCbW7REXBUJM8MvIWvbwf7af8ToVCoVplCz8ResC2wQhNEtOovcr1978+2vMHKE2voXRd1ZP1X5JMFDPa/YvDg5S+8lqjRvlPjY1a7tFJ45FXNaLZwW86MO2gsZffHpZDGY3Joiiq6LSNDNXoC5KnUUf5426jSEmlw9hkFTTCuE4V06H7eWpI3gQa+ZFb2aGWREjlP8sPaXW5jVZPg+6UU4Sn54R1VbqoUW4lsosRYmUlMcYYnzyyxy2gmn3auU2HKywqNmF3ooPy3O4nuhxQYUOoIbEpIjpOSzN8Ps2XqC0++gtKuUOavXr2uglVp8xvS2glQk9LajXxjvekjHqSRH2/cFCDs0P2SGiS2INSHusJSdHA7AS3kzWhE1udu2QY5Q9NzbEqami4S1GN2j9Z4qKMsQkjU27jTGdMV4Nwj3d3nEOosiRvcBvimuyRMUGWrvf7JT89M38G/8+y86Ce3i6LDj5u8rD3drnclI32VYKQvzHhUqT3V/Z+P0Cson+E8OGvbVc6rz9+1DbSlm/UvhpSbn6zQOMMc3g0gI3/+FHraFv+L75v6Cae1b8ZAEFLrVfp5RMpDKqoXKJSS2UzlSIqW6hUU9lFZRWVZVIbnBbPUbkcaBW/gnLJBUfFYphH13liC8wTPoR0rS5kwykhO/CJdjWcgaOSk/p9ofc7isOp3hVKMAF60f39YhP0p9JXv2bCBOlJmsMULIYmkmL6HfS7YBw8wyxsKftM6Ce8LLTgWHxdDBHHiKukOGmO1CCL8lz5qCHGkGN40vCiIij3KNXKn4xuY7OJme4znTfXWsotl6zZ1lXWd23Ztud0hLpjHnSF6WAhq1Bho4ao6BZC6ap9z2CAidr/sEUjAZyhf/+h1RmEUitYF0BhOR11vO2+eFtdgnA2qqMug4sVw31QAqWwGObDDJhGqy8AL6TCFLJHL/ggg45Mqk2mHl4YSH0WQBmV+TAVCmEOdKO7w2Au9e9BtXthNh1eePDmXGV6aypdp9KYRXQuop6mf2PVnjdX1b6gWURraV8tzKXeGh2FNOZ/t+Igqs2kceNgIfWYQn0L9dmm6iMKdY68NMtcOpdSn8k07wzq56XxJbR6of7s5/OM0WcpI4pK6JhFd7VVy6hviT6Tj9bOhKw7RnWOEYLKFPi1/s3TL3/ddVvSvoBz6PmKW//aLpwiSSREUY5+F83ci2YeDDkwBIaSHIbDSLgfHoBcGE0YjIGxtN54iv/58BA8DJOYACfhlPZWy7Bw7oycgRkZHdfMjuvdDcJSf+AGx1YX/pSEf/fhj7X4Nxv+wPE6x/9Owr/a8C+1eC0Jv3/6Xul7jldr8btabGnFb1vxvzh+0we/HohfcfzSh19cGSN9UYtXqOOVMXj583Tpcit+no6XOP6Z40Uf/smFn9Xipxw/ceJ/PoEfH8P/4PgRdf/oCbxwfoh04Qk8PwTP/TFKOsfxj1H4IccPOL7P8Q8cz9bie2dipPc4nonBd314muNbKxzSWx78fSg2c3yT4xscT3E8yfF1jic4HufYxPEYx6MObKxMkho5Nhw5JjVwPHJ4knTkGB5ZKh7+XZJ0eJI/gIf94u+S8BDH12rxIMcDHOs5vspxfxG+YsN9e5OkfUW4d49T2puEe5y4m4je3Yq7OL7McSfHHU7czvGlbTbpJR9us+GLRVhHXepqcSvHLS9YKDPFFyy4+fkIaXMRPr9JlZ6PwE0qbjThcxw31FqlDRxrrbieBq2vxXVrbdK6VFxrw2dbcU3NMWkNx5rqSVLNMaxZKlb/NkmqnoTVfvG3SfgMx9WrekirOa7qgU8Tm0/fiyufMksrXfiUGavoRlURVhJSlUm4woG/4fjkcof0JMflDlzGcSnHCo7+wK+feEL6NccnnsDHi7A8zy2VJ+ESjos5PmbDRy24yIQLOS5oxbJWnN+K81qxlGMJx7kcZ8fhLI4zHQOlmWNwBsfpT+A0ahRznMqxiOMUjpM5FvbBglZ8xIKTOD7EcSLH/AkmKb8VJ5hwfGiENN6H4ziOpZXHDsQ8N45hqjQmHB904ejhIdJojrlmfIDjqPtVaRTH+1UcyXEEPRnBcfgwVRoegsOirdIwFYdacQjHnFocXIuDON4ndJfua8WBx/DeEejnOIDjPf2d0j0u7N/PLvV3Yr++VqmfP2DHvlbswzGbY+9eLql3K/bqqUq9XNgzyyz1VDHLjHfHYKYVfXeZJR/Hu8yYkW6WMqyYbsYe3Y1SDxW7G7GbD7t2SZK6FmGXNKfUJQnTnJiakiSl3ospSZicZJaS7ZhkxkSOCRzj7RhHfMY50VuEsa0YQyzEFGG0FT2EoIdjVCtGDsQIakRwDC/CMEIqjGMoDQqNQDdHF8cQjk7q4OSUI3eXHANRfQLtRWjjaLWESlaOFuptCUUzR5OKRo4KdVM4GlwoF6FID0XSADfSXeQoUFvojkxF4MgaWNGKZ1jX/x9+8P+agP/rL/p/ADs9e58KZW5kc3RyZWFtCmVuZG9iagoxNyAwIG9iago8PCAvTGVuZ3RoIDgzIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nKWMRwqAUAwFB+zdb++93P+GZuFWEXwwjwkJgZ/RXna6YGBiYePg4uETEH7+HQnx7erxKiElI6egpJK5ppFu6egZGMUnZhZWNvGdg/MCgMUDCgplbmRzdHJlYW0KZW5kb2JqCjIwIDAgb2JqCjw8IC9MZW5ndGggMzU0IC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nF1SPW+DMBTc+RUe0yEiEDCNhJCqdGHoh0o7VRkIfkRIxViGDPz72j6bSrUEpzu/ez7wi8/1cy2HhcXveuoaWlg/SKFpnu66I3al2yCjJGVi6BbP3LsbWxXFxtys80JjLfspKksWf5jNedEr2z2J6UoPEWMsftOC9CBvbPd1biA1d6V+aCS5sENUVUxQb9q9tOq1HYnFzryvhdkflnVvbH8Vn6siljqeIFI3CZpV25Fu5Y2i8mBWxcrerCoiKf7tJ9527bf61NYDvoEXJxNk8jJBPqL6eIIcaAJIAUdABsgBHFAAHkMb1zXDYZk/LPOH5eie+2i5j5ajsYVvoJM5QvACcqDIxJGJw8qRifPgQIMTaOsbbBS7naOFD+MpFwB8AO9DjbMUOLvgsASKKAWiFHmogQU/pvAZAj0F9WIvNdyevV87jNvwdHetzdy4iXUDY0dlkLQNtZqUddnnF8/7ylUKZW5kc3RyZWFtCmVuZG9iagoxNSAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvQ0lERm9udFR5cGUyIC9CYXNlRm9udCAvQk1RUURWK0RlamFWdVNhbnMKL0NJRFN5c3RlbUluZm8gPDwgL1JlZ2lzdHJ5IChBZG9iZSkgL09yZGVyaW5nIChJZGVudGl0eSkgL1N1cHBsZW1lbnQgMCA+PgovRm9udERlc2NyaXB0b3IgMTQgMCBSIC9XIDE5IDAgUiAvQ0lEVG9HSURNYXAgMTcgMCBSID4+CmVuZG9iagoxNiAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTAgL0Jhc2VGb250IC9CTVFRRFYrRGVqYVZ1U2FucwovRW5jb2RpbmcgL0lkZW50aXR5LUggL0Rlc2NlbmRhbnRGb250cyBbIDE1IDAgUiBdIC9Ub1VuaWNvZGUgMjAgMCBSID4+CmVuZG9iagoxNCAwIG9iago8PCAvVHlwZSAvRm9udERlc2NyaXB0b3IgL0ZvbnROYW1lIC9CTVFRRFYrRGVqYVZ1U2FucyAvRmxhZ3MgMzIKL0ZvbnRCQm94IFsgLTEwMjEgLTQ2MyAxNzk0IDEyMzMgXSAvQXNjZW50IDkyOSAvRGVzY2VudCAtMjM2IC9DYXBIZWlnaHQgMAovWEhlaWdodCAwIC9JdGFsaWNBbmdsZSAwIC9TdGVtViAwIC9Gb250RmlsZTIgMTggMCBSIC9NYXhXaWR0aCA5NzQgPj4KZW5kb2JqCjE5IDAgb2JqClsgMzIgWyAzMTggXSA0NiBbIDMxOCBdIDQ4IFsgNjM2IDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IDYzNiA2MzYgNjM2IF0gNzgKWyA3NDggXSA4MCBbIDYwMyBdIDg0IFsgNjExIF0gOTcgWyA2MTMgNjM1IDU1MCA2MzUgNjE1IDM1MiA2MzUgXSAxMDUKWyAyNzggMjc4IF0gMTA4IFsgMjc4IDk3NCA2MzQgNjEyIDYzNSBdIDExNCBbIDQxMSA1MjEgMzkyIDYzNCA1OTIgXSAxMjAKWyA1OTIgNTkyIDUyNSBdIF0KZW5kb2JqCjMgMCBvYmoKPDwgL0YxIDE2IDAgUiA+PgplbmRvYmoKNCAwIG9iago8PCAvQTEgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMCAvY2EgMSA+PgovQTIgPDwgL1R5cGUgL0V4dEdTdGF0ZSAvQ0EgMSAvY2EgMSA+PiA+PgplbmRvYmoKNSAwIG9iago8PCA+PgplbmRvYmoKNiAwIG9iago8PCA+PgplbmRvYmoKNyAwIG9iago8PCAvSTEgMTMgMCBSID4+CmVuZG9iagoxMyAwIG9iago8PCAvVHlwZSAvWE9iamVjdCAvU3VidHlwZSAvSW1hZ2UgL1dpZHRoIDQ2MiAvSGVpZ2h0IDQ2MwovQ29sb3JTcGFjZSBbL0luZGV4ZWQgL0RldmljZVJHQiAxNSAo9/v/9vr+9fn+9Pn+8/j98ff98Pb87/b87fT76PH64u34D1ujCEiPCEaMCDZ0CDBrKV0KL0JpdHNQZXJDb21wb25lbnQgOCAvRmlsdGVyIC9GbGF0ZURlY29kZQovRGVjb2RlUGFybXMgPDwgL1ByZWRpY3RvciAxMCAvQ29sb3JzIDEgL0NvbHVtbnMgNDYyID4+IC9MZW5ndGggMjEgMCBSID4+CnN0cmVhbQp4nO3S220VAAxEwRsSXgmE/rulAj7WWiUrMacC2+PHn2ZPzR7NvjR7aVZd84ETJ06cOHG2wokTJ06cOGvhxIkTJ06crXDixIkTJ85aOHHixIkTZyucOHHixImzFk6cOHHixNkKJ06cOHHirIUTJ06cOHG2wokTJ06cOGvhxIkTJ06crXDixIkTJ85aOHHixIkTZyucOHHixImzFk6cOHHixNkKJ06cOHHirIUTJ06cOHG2wokTJ06cOGvhxIkTJ06crXDixIkTJ85aOHHixIkTZyucOHHi/FzO52bvzb43q96s+hvVyXAewokTJ844nHk4ceLEGYczDydOnDjjcObhxIkTZxzOPJw4ceKMw5mHEydOnHE483DixIkzDmceTpw4ccbhzMOJEyfOOJx5OHHixBmHMw8nTpw443Dm4cSJE2cczjycOHHijMOZhxMnTpxxOPNw4sSJMw5nHk6cOHHG4czDiRMnzjiceThx4sQZhzMPJ06cOONw5uHEiRNnHM48nDhx4ozDmYcTJ06ccTjzcOLEiTMOZx5OnDhxxuHMw4kTJ844nHnDnC/NXpv9ajYsUA0nTpyr4cSJczWcOHGuhhMnztVw4sS5Gk6cOFfDiRPnajhx4lwNJ06cq+HEiXM1nDhxroYTJ87VcOLEuRpOnDhXw4kT52o4ceJcDSdOnKvhxIlzNZw4ca6GEyfO1XDixLkaTpw4V8OJE+dqOHHiXA0nTpyr4cSJczWcOHGuhhMnztVw4sS5Gk6cOFfDiRPnajhx4lwNJ06cq+HEiXM1nDhxroYTJ87VcOLEuRpOnDhX+9Zs92jvzZ6bVdfEiRMnTpw4a+HEiRMnTpy1cOLEiRMnzlY4ceLEiRNnLZw4ceLEibMVTpw4ceLEWQsnTpw4ceJshRMnTpw4cdbCiRMnTpw4W+HEiRMnTpy1cOLEiRMnzlY4ceLEiRNnLZw4ceLEibMVTpw4ceLEWQsnTpw4ceJshRMnTpw4cdbCiRMnTpw4W+HEiRMnTpy1cOLEiRMnzlY4ceLEiRNnLZw4ceLEibMVTpw4ceLEWavKubvnj2a/m31thhMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLEiRMnTpw4ceLE+SGcVYGfzaqTVX/jtdlbM5w4/xlOnDhxpuE8hBMnTpxpOA/hxIkTZxrOQzhx4sSZhvMQTpw4cabhPIQTJ06caTgP4cSJE2cazkM4ceLEmYbzEE6cOHGm4TyEEydOnGk4D+HEiRNnGs5DOHHixJmG8xBOnDhxpuE8hBMnTpxpOA/hxIkTZxrOQzhx4sSZhvMQTpw4cabhPIQTJ06caTgP4cSJE2cazkM4ceLEmYbzEE6cOHGm4TyEEydOnGk4D+HEiRNnGs5DOHHixJmG8xBOnDhxpv0nnH8B/mKX8wplbmRzdHJlYW0KZW5kb2JqCjIxIDAgb2JqCjExNzgKZW5kb2JqCjIgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9LaWRzIFsgMTEgMCBSIF0gL0NvdW50IDEgPj4KZW5kb2JqCjIyIDAgb2JqCjw8IC9DcmVhdG9yIChNYXRwbG90bGliIHYzLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZykKL1Byb2R1Y2VyIChNYXRwbG90bGliIHBkZiBiYWNrZW5kIHYzLjcuMSkgL0NyZWF0aW9uRGF0ZSAoRDoyMDI0MDUyOTIwMTYwNlopCj4+CmVuZG9iagp4cmVmCjAgMjMKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMDE2IDAwMDAwIG4gCjAwMDAwMTMyOTQgMDAwMDAgbiAKMDAwMDAxMTU5NiAwMDAwMCBuIAowMDAwMDExNjI4IDAwMDAwIG4gCjAwMDAwMTE3MjcgMDAwMDAgbiAKMDAwMDAxMTc0OCAwMDAwMCBuIAowMDAwMDExNzY5IDAwMDAwIG4gCjAwMDAwMDAwNjUgMDAwMDAgbiAKMDAwMDAwMDM0NCAwMDAwMCBuIAowMDAwMDAxNTEwIDAwMDAwIG4gCjAwMDAwMDAyMDggMDAwMDAgbiAKMDAwMDAwMTQ4OSAwMDAwMCBuIAowMDAwMDExODAxIDAwMDAwIG4gCjAwMDAwMTExMjMgMDAwMDAgbiAKMDAwMDAxMDc2MyAwMDAwMCBuIAowMDAwMDEwOTc2IDAwMDAwIG4gCjAwMDAwMTAxODEgMDAwMDAgbiAKMDAwMDAwMTUzMCAwMDAwMCBuIAowMDAwMDExMzQ3IDAwMDAwIG4gCjAwMDAwMTAzMzYgMDAwMDAgbiAKMDAwMDAxMzI3MyAwMDAwMCBuIAowMDAwMDEzMzU0IDAwMDAwIG4gCnRyYWlsZXIKPDwgL1NpemUgMjMgL1Jvb3QgMSAwIFIgL0luZm8gMjIgMCBSID4+CnN0YXJ0eHJlZgoxMzUwNQolJUVPRgo=\n"
},
"metadata": {}
}
],
"source": [
"plot_confusion_matrix(y_preds, y_valid, labels)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JZraZ1WUYHZu"
},
"source": [
"This is much closer to the ideal diagonal confusion matrix. The `love` category is still often confused with `joy`, which seems natural. `surprise` is also frequently mistaken for `joy`, or confused with `fear`. Overall the performance of the model seems quite good, but before we call it a day, let's dive a little deeper into the types of errors our model is likely to make."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qvjzbgd6YHZv"
},
"source": [
"#### Error analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5Jgyp_KBYHZv"
},
"source": [
"Before moving on, we should investigate our model's predictions a little bit further. A simple yet powerful technique is to sort the validation samples by the model loss. When we pass the label during the forward pass, the loss is automatically calculated and returned. Here's a function that returns the loss along with the predicted label:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"id": "xJqOAUS9YHZv"
},
"outputs": [],
"source": [
"from torch.nn.functional import cross_entropy\n",
"\n",
"def forward_pass_with_label(batch):\n",
" # Place all input tensors on the same device as the model\n",
" inputs = {k:v.to(device) for k,v in batch.items()\n",
" if k in tokenizer.model_input_names}\n",
"\n",
" with torch.no_grad():\n",
" output = model(**inputs)\n",
" pred_label = torch.argmax(output.logits, axis=-1)\n",
" loss = cross_entropy(output.logits, batch[\"label\"].to(device),\n",
" reduction=\"none\")\n",
"\n",
" # Place outputs on CPU for compatibility with other dataset columns\n",
" return {\"loss\": loss.cpu().numpy(),\n",
" \"predicted_label\": pred_label.cpu().numpy()}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "glNWB28pYHZv"
},
"source": [
"Using the `map()` method once more, we can apply this function to get the losses for all the samples:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"id": "Aer9Mt7nYHZw",
"outputId": "e445cafa-495d-4e92-95a5-da5994a04b04",
"colab": {
"referenced_widgets": [
"2ad9558ed9ae40b78e80f251bf430e6d",
"e3a5ee899f0f42d6b6bdde1ef1dde276",
"41ec41d8c0af48a38044620f4533189b",
"250060986caf43a7a9b883af202e8526",
"c54e038f8c164e738ff75105729e8524",
"a0c48bf3053c4363bd84fd6d753c48f6",
"f5c0a3de1bc74f57be7e5e4d3f27abce",
"b4fb55956b4e4d44a26ca801b0e022ce",
"8736622c878649dca98a1f71c33c4513",
"488e1a74089949b7b0382642e47f32a2",
"c4f059c5fe124ec7a1141ae6390fbe71"
],
"base_uri": "https://localhost:8080/",
"height": 49
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"Map: 0%| | 0/2000 [00:00, ? examples/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "2ad9558ed9ae40b78e80f251bf430e6d"
}
},
"metadata": {}
}
],
"source": [
"#hide_output\n",
"# Convert our dataset back to PyTorch tensors\n",
"emotions_encoded.set_format(\"torch\",\n",
" columns=[\"input_ids\", \"attention_mask\", \"label\"])\n",
"# Compute loss values\n",
"emotions_encoded[\"validation\"] = emotions_encoded[\"validation\"].map(\n",
" forward_pass_with_label, batched=True, batch_size=16)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AvolYiFLYHZw"
},
"source": [
"Finally, we create a `DataFrame` with the texts, losses, and predicted/true labels:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"id": "5x1aGi-qYHZw"
},
"outputs": [],
"source": [
"emotions_encoded.set_format(\"pandas\")\n",
"cols = [\"text\", \"label\", \"predicted_label\", \"loss\"]\n",
"df_test = emotions_encoded[\"validation\"][:][cols]\n",
"df_test[\"label\"] = df_test[\"label\"].apply(label_int2str)\n",
"df_test[\"predicted_label\"] = (df_test[\"predicted_label\"]\n",
" .apply(label_int2str))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fGLpjMX_YHZw"
},
"source": [
"We can now easily sort `emotions_encoded` by the losses in either ascending or descending order. The goal of this exercise is to detect one of the following:\n",
"\n",
"- _Wrong labels_:: Every process that adds labels to data can be flawed. Annotators can make mistakes or disagree, while labels that are inferred from other features can be wrong. If it was easy to automatically annotate data, then we would not need a model to do it. Thus, it is normal that there are some wrongly labeled examples. With this approach, we can quickly find and correct them.\n",
"\n",
"- _Quirks of the dataset_:: Datasets in the real world are always a bit messy. When working with text, special characters or strings in the inputs can have a big impact on the model's predictions. Inspecting the model's weakest predictions can help identify such features, and cleaning the data or injecting similar examples can make the model more robust.\n",
"\n",
"Let's first have a look at the data samples with the highest losses:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"id": "Fz4mrCWYYHZw",
"outputId": "a3fbbc0b-93bc-4a32-bfb7-3bf3a3fc4363",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" text label \\\n",
"882 i feel badly about reneging on my commitment t... love \n",
"1950 i as representative of everything thats wrong ... surprise \n",
"1274 i am going to several holiday parties and i ca... joy \n",
"765 i feel super awkward and out of place right now joy \n",
"1500 i guess we would naturally feel a sense of lon... anger \n",
"465 i would eventually go in to these stores but i... joy \n",
"1963 i called myself pro life and voted for perry w... joy \n",
"1870 i guess i feel betrayed because i admired him ... joy \n",
"1801 i feel that he was being overshadowed by the s... love \n",
"177 im sure much of the advantage is psychological... sadness \n",
"\n",
" predicted_label loss \n",
"882 sadness 5.572596 \n",
"1950 sadness 5.505824 \n",
"1274 sadness 5.326068 \n",
"765 sadness 5.296478 \n",
"1500 sadness 5.103446 \n",
"465 fear 5.033174 \n",
"1963 sadness 4.995793 \n",
"1870 sadness 4.902756 \n",
"1801 sadness 4.854362 \n",
"177 joy 4.770090 "
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
text
\n",
"
label
\n",
"
predicted_label
\n",
"
loss
\n",
"
\n",
" \n",
" \n",
"
\n",
"
882
\n",
"
i feel badly about reneging on my commitment t...
\n",
"
love
\n",
"
sadness
\n",
"
5.572596
\n",
"
\n",
"
\n",
"
1950
\n",
"
i as representative of everything thats wrong ...
\n",
"
surprise
\n",
"
sadness
\n",
"
5.505824
\n",
"
\n",
"
\n",
"
1274
\n",
"
i am going to several holiday parties and i ca...
\n",
"
joy
\n",
"
sadness
\n",
"
5.326068
\n",
"
\n",
"
\n",
"
765
\n",
"
i feel super awkward and out of place right now
\n",
"
joy
\n",
"
sadness
\n",
"
5.296478
\n",
"
\n",
"
\n",
"
1500
\n",
"
i guess we would naturally feel a sense of lon...
\n",
"
anger
\n",
"
sadness
\n",
"
5.103446
\n",
"
\n",
"
\n",
"
465
\n",
"
i would eventually go in to these stores but i...
\n",
"
joy
\n",
"
fear
\n",
"
5.033174
\n",
"
\n",
"
\n",
"
1963
\n",
"
i called myself pro life and voted for perry w...
\n",
"
joy
\n",
"
sadness
\n",
"
4.995793
\n",
"
\n",
"
\n",
"
1870
\n",
"
i guess i feel betrayed because i admired him ...
\n",
"
joy
\n",
"
sadness
\n",
"
4.902756
\n",
"
\n",
"
\n",
"
1801
\n",
"
i feel that he was being overshadowed by the s...
\n",
"
love
\n",
"
sadness
\n",
"
4.854362
\n",
"
\n",
"
\n",
"
177
\n",
"
im sure much of the advantage is psychological...
\n",
"
sadness
\n",
"
joy
\n",
"
4.770090
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_test\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"i feel that he was being overshadowed by the supporting characters\",\n \"i as representative of everything thats wrong with corporate america and feel that sending him to washington is a ludicrous idea\",\n \"i would eventually go in to these stores but i had to work up a lot of courage and i would still feel super uncomfortable once inside which we all know is not normal for me\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"surprise\",\n \"sadness\",\n \"joy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"sadness\",\n \"fear\",\n \"joy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"loss\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 10,\n \"samples\": [\n 4.8543620109558105,\n 5.505823612213135,\n 5.033173561096191\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 65
}
],
"source": [
"#hide_output\n",
"df_test.sort_values(\"loss\", ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "APgmhdPhYHZw"
},
"source": [
"We can clearly see that the model predicted some of the labels incorrectly. On the other hand, it seems that there are quite a few examples with no clear class, which might be either mislabeled or require a new class altogether. In particular, `joy` seems to be mislabeled several times. With this information we can refine the dataset, which often can lead to as big a performance gain (or more) as having more data or larger models!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4ByyDxi1YHZw"
},
"source": [
"When looking at the samples with the lowest losses, we observe that the model seems to be most confident when predicting the `sadness` class. Deep learning models are exceptionally good at finding and exploiting shortcuts to get to a prediction. For this reason, it is also worth investing time into looking at the examples that the model is most confident about, so that we can be confident that the model does not improperly exploit certain features of the text. So, let's also look at the predictions with the smallest loss:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"id": "5xDbm5LmYHZw",
"outputId": "c55d2bd9-0f2c-4f68-fe55-1769390558c4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" text label \\\n",
"1452 i always feel guilty and come to one conclusio... sadness \n",
"1466 i feel so ungrateful to be wishing this pregna... sadness \n",
"69 i have no extra money im worried all of the ti... sadness \n",
"1182 i feel broke inside but i won t admit sadness \n",
"1152 i feel pathetic because i shouldn t complain a... sadness \n",
"1310 i feel like an ungrateful asshole sadness \n",
"394 i feel shamed that i hoped for one last christ... sadness \n",
"1861 im tired of feeling lethargic hating to work o... sadness \n",
"1303 i feel pathetic and uninspired sadness \n",
"375 i mention that i feel really unwelcome sadness \n",
"\n",
" predicted_label loss \n",
"1452 sadness 0.016432 \n",
"1466 sadness 0.017092 \n",
"69 sadness 0.017138 \n",
"1182 sadness 0.017273 \n",
"1152 sadness 0.017505 \n",
"1310 sadness 0.017604 \n",
"394 sadness 0.017729 \n",
"1861 sadness 0.017791 \n",
"1303 sadness 0.017827 \n",
"375 sadness 0.017835 "
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
text
\n",
"
label
\n",
"
predicted_label
\n",
"
loss
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1452
\n",
"
i always feel guilty and come to one conclusio...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.016432
\n",
"
\n",
"
\n",
"
1466
\n",
"
i feel so ungrateful to be wishing this pregna...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017092
\n",
"
\n",
"
\n",
"
69
\n",
"
i have no extra money im worried all of the ti...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017138
\n",
"
\n",
"
\n",
"
1182
\n",
"
i feel broke inside but i won t admit
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017273
\n",
"
\n",
"
\n",
"
1152
\n",
"
i feel pathetic because i shouldn t complain a...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017505
\n",
"
\n",
"
\n",
"
1310
\n",
"
i feel like an ungrateful asshole
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017604
\n",
"
\n",
"
\n",
"
394
\n",
"
i feel shamed that i hoped for one last christ...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017729
\n",
"
\n",
"
\n",
"
1861
\n",
"
im tired of feeling lethargic hating to work o...
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017791
\n",
"
\n",
"
\n",
"
1303
\n",
"
i feel pathetic and uninspired
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017827
\n",
"
\n",
"
\n",
"
375
\n",
"
i mention that i feel really unwelcome
\n",
"
sadness
\n",
"
sadness
\n",
"
0.017835
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"
\n",
"\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_test\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"i feel pathetic and uninspired\",\n \"i feel so ungrateful to be wishing this pregnancy over now\",\n \"i feel like an ungrateful asshole\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"sadness\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"sadness\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"loss\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 10,\n \"samples\": [\n 0.017827395349740982\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 66
}
],
"source": [
"#hide_output\n",
"df_test.sort_values(\"loss\", ascending=True).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xcaRCEBkYHZx"
},
"source": [
"We now know that the `joy` is sometimes mislabeled and that the model is most confident about predicting the label `sadness`. With this information we can make targeted improvements to our dataset, and also keep an eye on the class the model seems to be very confident about.\n",
"\n",
"The last step before serving the trained model is to save it for later usage. image:images/logo.png[hf,13,13] Transformers allows us to do this in a few steps, which we'll show you in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vy6o5jWFYHZx"
},
"source": [
"#### Saving and sharing the model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "glNUBTRyYHZx"
},
"source": [
"\n",
"The NLP community benefits greatly from sharing pretrained and fine-tuned models, and everybody can share their models with others via the Hugging Face Hub. Any community-generated model can be downloaded from the Hub just like we downloaded the DistilBERT model. With the `Trainer` API, saving and sharing a model is simple:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"id": "LYKoiSDZYHZx",
"outputId": "7abb8a73-bb45-44ae-9068-75b734f758b2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 124,
"referenced_widgets": [
"2eb00c32e3a742b68c4a37e2541ca4a7",
"82096e6baf324cbdbe045fad2b415b26",
"a8982178002440ce9510fd83cd207330",
"e6303067dda247699bda47aabd3c7165",
"ed1bc789d4054b5d9e098d4be60a7b19",
"c233f76c380a44b5af869be8ea28dcb7",
"9ce54d513b384fd0b4cae450599399c7",
"4b54335d6cbf486cb764cd34c10eec61",
"ea9f0f850d7b4286a96e3a75a057af71",
"01fc4e40b6ff4286bcb613575cc4283e",
"f78f6c4c7c98488a913ed7ded00e7c76"
]
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"events.out.tfevents.1717013477.75eecaf684d2.2485.0: 0%| | 0.00/6.54k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "2eb00c32e3a742b68c4a37e2541ca4a7"
}
},
"metadata": {}
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"CommitInfo(commit_url='https://huggingface.co/osscar0131/distilbert-base-uncased-finetuned-emotion/commit/2396b0076d4c48266ef3674f16bedfc0dd2c1c21', commit_message='Training completed!', commit_description='', oid='2396b0076d4c48266ef3674f16bedfc0dd2c1c21', pr_url=None, pr_revision=None, pr_num=None)"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 67
}
],
"source": [
"#hide_output\n",
"trainer.push_to_hub(commit_message=\"Training completed!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YNYIHvRCYHZx"
},
"source": [
"We can also use the fine-tuned model to make predictions on new tweets. Since we've pushed our model to the Hub, we can now use it with the `pipeline()` function, just like we did in <>. First, let's load the pipeline:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"id": "e5gEijHAYHZx",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 209,
"referenced_widgets": [
"a63ceaeefacd46e5a38b624d0e0425a2",
"aa161c3421fe46eb94b922354cf95db1",
"0eef33af060a4907bd6ebfee33d2717d",
"e70819db5e79454788a525b996af66ff",
"641574c76da0400485db59c93027bd0c",
"daf7e80c2a5b468caa8ca481b61f9de2",
"142013b6c9fd476fbcccdafd559fa77e",
"2f24a91abfbc417db91cbaf8f17cfdfd",
"bc143a2768a2440fa43a7fbed6075315",
"5437f878021743d2893e4f98c47c96ae",
"8346490808ae4681a87261ef0478e5fe",
"104bf86244f24f89bfdbec815b4dd026",
"8481371b1eb743f4bc7ee41a1cd4fa71",
"407e540c4443431da309b41b33c96cd7",
"06a93bf890d44533b520f9b3473d4e23",
"1f33cd3b840f4f18bbdd7ab463fa1082",
"d7f78d6afbf445288b69bf82a7beeaa9",
"b6ba77da68a546de9c056cdeba29f748",
"b39a3db720fb4b80ae77d25c66731a07",
"22954e03a1034d80ba50998a89e7a4f5",
"d8018ada6d4748cebc24c55a596e4a68",
"45df133817194e15866ac8a4daa018e1",
"25a31b0f22a3487592ecdb18aceb4e56",
"2eee442568af45db860db13f8fa365e8",
"5e682347ac1144019b1d391df6dcdcda",
"034c355debba468abdb763251125b6ca",
"91d04c4bb8544ed4a15f64ce14bd1c40",
"6202ca0e9a9b45ca9979fb378907a9b8",
"d950f132b9f14ab79964432b7d5f8068",
"f055b72a9a814ccdbbcc60792c0335d5",
"3f92ee9687da4295ae520326efebb410",
"948f69c154f24906bacea3db182d43e4",
"b5a615298e0541c4bfa10158aec3ca79",
"6d9985bfc2fc427a9ce2e744c4f20f9e",
"c05808fdd92a43169c121fb6f29c28fb",
"747f2338056343778e6ba6be1d8cab8e",
"bd32695b923f47cba7039a9cd09f5094",
"13722354d057478d886009996aa47a9e",
"f7e29388b46c469a9c2b6b255d7cf624",
"5d43160187ae41f1bce32c5a6d2a304f",
"6c2e195784f04f6aa9d8958d86b33487",
"c7a9cd3163a043ed98efe7a7770eaa77",
"7cc4455b15394799aae8ea6e850696eb",
"16ba07134b06403cbdcf973c06aa9303",
"fcacaf40e10c42abb56f245f1191bbe7",
"8b733b3aeb6145c490d1c9fe107e82b4",
"553c0e65f77a4c6ebb8bc508697544f9",
"94207ab5d7524500a05984c3775986a3",
"1cb01d102768453ea245c9835b4a685e",
"f4302ea897584435a86e557e4ccf0ab9",
"901f584c7f4f42c99b70a104edbf3220",
"3ca3beba8f0848b3b7ef901f3cec9842",
"400473b97cfb474da69ff8364f65e75a",
"a4721599c5d841fcac7ae4fed8548fd1",
"08a0268b55ae43cf9dace159dc234f0f",
"bc02a36040cc47bc9dd4f366a346e8ad",
"47d592dc0ae04f4faddbe69151f4369a",
"538fcfd9eb4d4bf39cd46cd39eb572e3",
"63732b652a9b45879d7ce2a8cba6bf83",
"f9e08385306d46e298607f2e1753bbfd",
"d25917c2d7ec48b1b2ef3eb285112fc7",
"c237cbd4500a489fadb0682de35c8a8c",
"a8f50a0440754de791631d0cedf6c30e",
"368d80192215471897ab473c870b239d",
"563ce812353f4ae59041c95a2dc33f78",
"5a49f68ab8b34b59b8704a75baf8aa9c"
]
},
"outputId": "4519c108-78a1-46de-f282-8647d9272249"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"config.json: 0%| | 0.00/872 [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "a63ceaeefacd46e5a38b624d0e0425a2"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"pytorch_model.bin: 0%| | 0.00/268M [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "104bf86244f24f89bfdbec815b4dd026"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/333 [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "25a31b0f22a3487592ecdb18aceb4e56"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"vocab.txt: 0%| | 0.00/232k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "6d9985bfc2fc427a9ce2e744c4f20f9e"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer.json: 0%| | 0.00/466k [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "fcacaf40e10c42abb56f245f1191bbe7"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/112 [00:00, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "bc02a36040cc47bc9dd4f366a346e8ad"
}
},
"metadata": {}
}
],
"source": [
"#hide_output\n",
"from transformers import pipeline\n",
"\n",
"# Change `transformersbook` to your Hub username\n",
"model_id = \"transformersbook/distilbert-base-uncased-finetuned-emotion\"\n",
"classifier = pipeline(\"text-classification\", model=model_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6aIlu2EdYHZx"
},
"source": [
"Then let's test the pipeline with a sample tweet:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"id": "jzh8kfqFYHZx",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "db36f3f6-05b0-46af-c6d2-beefbcdb8d92"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated, if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.\n",
" warnings.warn(\n"
]
}
],
"source": [
"custom_tweet = \"What the hell!\"\n",
"preds = classifier(custom_tweet, return_all_scores=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ju2dYxiAYHZy"
},
"source": [
"Finally, we can plot the probability for each class in a bar plot. Clearly, the model estimates that the most likely class is `joy`, which appears to be reasonable given the tweet:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"id": "JFuVj6ZtYHZy",
"outputId": "3c3e0d90-302e-49a0-d71a-076184597c62",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 382
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"