DONGJOONSHIN commited on
Commit
9e4703e
โ€ข
1 Parent(s): 3e6d04f
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "ehdwns1516/klue-roberta-base_sae",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "yes/no",
15
+ "1": "alternative",
16
+ "2": "wh- questions",
17
+ "3": "prohibitions",
18
+ "4": "requirements",
19
+ "5": "strong requirements"
20
+ },
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 3072,
23
+ "label2id": {
24
+ "yes/no": 0,
25
+ "alternative": 1,
26
+ "wh- questions": 2,
27
+ "prohibitions": 3,
28
+ "requirements": 4,
29
+ "strong requirements": 5
30
+ },
31
+ "layer_norm_eps": 1e-05,
32
+ "max_position_embeddings": 512,
33
+ "model_type": "roberta",
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 12,
36
+ "pad_token_id": 1,
37
+ "position_embedding_type": "absolute",
38
+ "problem_type": "single_label_classification",
39
+ "tokenizer_class": "BertTokenizer",
40
+ "torch_dtype": "float32",
41
+ "transformers_version": "4.9.2",
42
+ "type_vocab_size": 1,
43
+ "use_cache": true,
44
+ "vocab_size": 32000
45
+ }
klue-roberta-base-kornli.ipynb ADDED
@@ -0,0 +1,854 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "4ef9f047",
6
+ "metadata": {},
7
+ "source": [
8
+ "# HuggingFace Transformers๋ฅผ ์ด์šฉํ•œ NLI ๋ชจ๋ธ ํ•™์Šต\n",
9
+ "\n",
10
+ "* ๋ณธ ๋…ธํŠธ๋ถ์—์„œ๋Š” klue/roberta-base ๋ชจ๋ธ์„ kakaobrain์—์„œ ์ œ๊ณตํ•˜๋Š” KorNLI ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค. \n",
11
+ "\n",
12
+ "* KorNLI ๋ฐ์ดํ„ฐ์…‹์€ [kakaobrain](https://github.com/kakaobrain/KorNLUDatasets?fbclid=IwAR0LX_jem7qb6HUikflO-F6lPpfoefK9Yc0jSQIdSKdkX4s8SW1UvoVGc7I)์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋ฉฐ, KLUE์˜ NLI ๋ฐ์ดํ„ฐ์…‹ ๋ณด๋‹ค ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋” ์ปค์„œ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. \n",
13
+ "\n",
14
+ "* ๋ชจ๋“  ์†Œ์Šค ์ฝ”๋“œ๋Š” [huggingface-notebook](https://github.com/huggingface/notebooks)์„ ์ฐธ๊ณ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.\n",
15
+ "\n",
16
+ "* ๋ณธ ๋…ธํŠธ๋ถ์˜ ํ•™์Šต ๋‚ด์šฉ ๋Œ€๋ถ€๋ถ„์€ Huffon๋‹˜์˜ [klue-transformers-tutorial](https://github.com/Huffon/klue-transformers-tutorial)์„ ์ฐธ๊ณ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.\n",
17
+ "\n",
18
+ "* ํ•™์Šต์„ ํ†ตํ•ด ์–ป์–ด์งˆ klue-roberta-base-kornli ๋ชจ๋ธ์€ ์ž…๋ ฅ๋œ ๋‘ ๋ฌธ์žฅ์˜ ์ถ”๋ก  ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.\n",
19
+ "\n",
20
+ "* ๋…ธํŠธ๋ถ์˜ ํ™˜๊ฒฝ์€ ainize workspace ์ž…๋‹ˆ๋‹ค. [ainze](https://ainize.ai/) ํ™ˆํŽ˜์ด์ง€ ์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ์˜ github๋กœ ์•„์ด๋”” ์ƒ์„ฑ ํ›„ My space์—์„œ workspace๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ œ๊ฐ€ ์‚ฌ์šฉํ•œ workspace์˜ GPU ํ™˜๊ฒฝ์€ tesla V100 32GB์ž…๋‹ˆ๋‹ค.\n",
21
+ "\n",
22
+ "\n",
23
+ "## ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ ์„ค์ •\n",
24
+ "\n",
25
+ "๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์ „์— ์ž์‹ ์˜ ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ์„ ํ™•์ธํ•˜๊ณ  ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค."
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": 1,
31
+ "id": "9ddee489",
32
+ "metadata": {},
33
+ "outputs": [
34
+ {
35
+ "name": "stdout",
36
+ "output_type": "stream",
37
+ "text": [
38
+ "True\n"
39
+ ]
40
+ }
41
+ ],
42
+ "source": [
43
+ "import torch\n",
44
+ "use_cuda = torch.cuda.is_available()\n",
45
+ "print(use_cuda)"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "markdown",
50
+ "id": "68700205",
51
+ "metadata": {},
52
+ "source": [
53
+ "๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ GPU๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•™์Šต์„ ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์ „ ์ž์‹ ์˜ ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ์—์„œ GPU๊ฐ€ ์ž˜ ๋™์ž‘ ํ•˜๋Š”์ง€ ํ™•์ธ์„ ํ•ฉ๋‹ˆ๋‹ค. True๊ฐ€ ์ถœ๋ ฅ์ด ๋œ๋‹ค๋ฉด ํ•™์Šตํ•  ๋•Œ GPU๊ฐ€ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์ด๊ณ  False๊ฐ€ ๋‚˜์˜ค๋ฉด ๋™์ž‘ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. False๊ฐ€ ์ถœ๋ ฅ์ด ๋˜์‹ ๋‹ค๋ฉด ๊ฒ€์ƒ‰์„ ํ†ตํ•ด GPU๋ฅผ ํ™œ์„ฑํ™” ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„ ํ™œ์„ฑํ™” ์‹œํ‚ค์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค."
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": 2,
59
+ "id": "49766327",
60
+ "metadata": {},
61
+ "outputs": [
62
+ {
63
+ "name": "stdout",
64
+ "output_type": "stream",
65
+ "text": [
66
+ "[name: \"/device:CPU:0\"\n",
67
+ "device_type: \"CPU\"\n",
68
+ "memory_limit: 268435456\n",
69
+ "locality {\n",
70
+ "}\n",
71
+ "incarnation: 4073418392972710111\n",
72
+ ", name: \"/device:GPU:0\"\n",
73
+ "device_type: \"GPU\"\n",
74
+ "memory_limit: 19146211328\n",
75
+ "locality {\n",
76
+ " bus_id: 1\n",
77
+ " links {\n",
78
+ " }\n",
79
+ "}\n",
80
+ "incarnation: 16025480191014330435\n",
81
+ "physical_device_desc: \"device: 0, name: Tesla V100-DGXS-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0\"\n",
82
+ "]\n"
83
+ ]
84
+ }
85
+ ],
86
+ "source": [
87
+ "from tensorflow.python.client import device_lib\n",
88
+ "print(device_lib.list_local_devices())"
89
+ ]
90
+ },
91
+ {
92
+ "cell_type": "markdown",
93
+ "id": "cbad9661",
94
+ "metadata": {},
95
+ "source": [
96
+ "๋ณธ์ธ์ด ์ž‘์„ฑํ•˜๊ณ  ์žˆ๋Š” ๋…ธํŠธ๋ถ GPU ํ™˜๊ฒฝ์„ ์•Œ ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์ €๋Š” ainize workspace์—์„œ Tesla V100 ํ™˜๊ฒฝ์—์„œ ๋…ธํŠธ๋ถ์„ ํ…Œ์ŠคํŠธํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— Tesla V100-DGXS-32GB์ด๋ผ๊ณ  ์ž˜ ๋‚˜์˜ค๋„ค์š”."
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": 3,
102
+ "id": "4a699d02",
103
+ "metadata": {},
104
+ "outputs": [],
105
+ "source": [
106
+ "#!pip install -U transformers datasets scipy scikit-learn"
107
+ ]
108
+ },
109
+ {
110
+ "cell_type": "markdown",
111
+ "id": "8c957b4c",
112
+ "metadata": {},
113
+ "source": [
114
+ "๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•œ transformers์™€ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ๋ฅผ ์œ„ํ•ด datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์™ธ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ์œ„ํ•ด scipy, scikit-learn ๋˜ํ•œ ์ถ”๊ฐ€๋กœ ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค.\n",
115
+ "\n",
116
+ "\n",
117
+ "## ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ•™์Šต"
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": 4,
123
+ "id": "401dacc7",
124
+ "metadata": {},
125
+ "outputs": [],
126
+ "source": [
127
+ "import random\n",
128
+ "import logging\n",
129
+ "import torch\n",
130
+ "from IPython.display import display, HTML\n",
131
+ "import numpy as np\n",
132
+ "import pandas as pd\n",
133
+ "import datasets\n",
134
+ "from datasets import load_dataset, load_metric, ClassLabel, Sequence, list_datasets\n",
135
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "id": "0ae8dc56",
141
+ "metadata": {},
142
+ "source": [
143
+ "๏ฟฝ๏ฟฝ๏ฟฝํŠธ๋ถ์„ ์‹คํ–‰ํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ชจ๋‘ ์ž„ํฌํŠธ ํ•ด์ค๋‹ˆ๋‹ค."
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "code",
148
+ "execution_count": 5,
149
+ "id": "1d99beee",
150
+ "metadata": {},
151
+ "outputs": [],
152
+ "source": [
153
+ "model_checkpoint = \"klue/roberta-base\"\n",
154
+ "batch_size = 32\n",
155
+ "task = \"nli\""
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "markdown",
160
+ "id": "048072e8",
161
+ "metadata": {},
162
+ "source": [
163
+ "ํ•™์Šต์— ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๋ณ€์ˆ˜๋กœ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.\n",
164
+ "\n",
165
+ "๋ณธ ๋…ธํŠธ๋ถ์• ์„œ๋Š” klue-roberta-base ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์ง€๋งŒ, https://huggingface.co/klue ํŽ˜์ด์ง€์—์„œ ๋” ๋‹ค์–‘ํ•œ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.\n",
166
+ "\n",
167
+ "ํ•™์Šต ํƒœ์Šคํฌ๋กœ๋Š” nli๋ฅผ, ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋Š” 32๋กœ ์ง€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค."
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": 6,
173
+ "id": "5e5c4151",
174
+ "metadata": {},
175
+ "outputs": [
176
+ {
177
+ "name": "stderr",
178
+ "output_type": "stream",
179
+ "text": [
180
+ "Reusing dataset kor_nlu (/workspace/.cache/huggingface/datasets/kor_nlu/nli/1.0.0/4facbba77df60b0658056ced2052633e681a50187b9428bd5752ebd59d332ba8)\n"
181
+ ]
182
+ }
183
+ ],
184
+ "source": [
185
+ "datasets = load_dataset(\"kor_nlu\", task)"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "markdown",
190
+ "id": "06d9dc6d",
191
+ "metadata": {},
192
+ "source": [
193
+ "์ €๋Š” ํ•™์Šต์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์„ kakaobrain์—์„œ ์ œ๊ณตํ•˜๋Š” KorNLI ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋ฏ€๋กœ HuggingFace datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋“ฑ๋ก๋œ kor_nlu์˜ nli ๋ฐ์ดํ„ฐ์„ ๋‹ค์šด๋กœ๋“œ ํ•ฉ๋‹ˆ๋‹ค."
194
+ ]
195
+ },
196
+ {
197
+ "cell_type": "code",
198
+ "execution_count": 7,
199
+ "id": "01457377",
200
+ "metadata": {},
201
+ "outputs": [
202
+ {
203
+ "data": {
204
+ "text/plain": [
205
+ "DatasetDict({\n",
206
+ " train: Dataset({\n",
207
+ " features: ['premise', 'hypothesis', 'label'],\n",
208
+ " num_rows: 550146\n",
209
+ " })\n",
210
+ " validation: Dataset({\n",
211
+ " features: ['premise', 'hypothesis', 'label'],\n",
212
+ " num_rows: 1570\n",
213
+ " })\n",
214
+ " test: Dataset({\n",
215
+ " features: ['premise', 'hypothesis', 'label'],\n",
216
+ " num_rows: 4954\n",
217
+ " })\n",
218
+ "})"
219
+ ]
220
+ },
221
+ "execution_count": 7,
222
+ "metadata": {},
223
+ "output_type": "execute_result"
224
+ }
225
+ ],
226
+ "source": [
227
+ "datasets"
228
+ ]
229
+ },
230
+ {
231
+ "cell_type": "markdown",
232
+ "id": "a8291fbf",
233
+ "metadata": {},
234
+ "source": [
235
+ "๋‹ค์šด๋กœ๋“œ ํ›„ ์–ป์–ด์ง„ datasets์˜ ๊ฐ์ฒด๋ฅผ ๋ณด๋ฉด KorNLI ๋ฐ์ดํ„ฐ์—๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค."
236
+ ]
237
+ },
238
+ {
239
+ "cell_type": "code",
240
+ "execution_count": 8,
241
+ "id": "07659e8c",
242
+ "metadata": {},
243
+ "outputs": [
244
+ {
245
+ "data": {
246
+ "text/plain": [
247
+ "{'premise': '๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.',\n",
248
+ " 'hypothesis': 'ํ•œ ์‚ฌ๋žŒ์ด ๊ฒฝ์Ÿ์„ ์œ„ํ•ด ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ์žˆ๋‹ค.',\n",
249
+ " 'label': 1}"
250
+ ]
251
+ },
252
+ "execution_count": 8,
253
+ "metadata": {},
254
+ "output_type": "execute_result"
255
+ }
256
+ ],
257
+ "source": [
258
+ "datasets[\"train\"][0]"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "markdown",
263
+ "id": "29d24878",
264
+ "metadata": {},
265
+ "source": [
266
+ "๊ฐ ๋ฐ์ดํ„ฐ๋Š” ์œ„์™€ ๊ฐ™์ด ๋‘๊ฐœ์˜ ๋ฌธ์žฅ๊ณผ ๋‘ ๋ฌธ์žฅ์˜ ์ถ”๋ก  ๊ด€๊ณ„๋ฅผ ๋ผ๋ฒจ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค."
267
+ ]
268
+ },
269
+ {
270
+ "cell_type": "code",
271
+ "execution_count": 9,
272
+ "id": "33ce978b",
273
+ "metadata": {},
274
+ "outputs": [],
275
+ "source": [
276
+ "def show_random_elements(dataset, num_examples=10):\n",
277
+ " assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n",
278
+ "\n",
279
+ " picks = []\n",
280
+ " \n",
281
+ " for _ in range(num_examples):\n",
282
+ " pick = random.randint(0, len(dataset)-1)\n",
283
+ "\n",
284
+ " # ์ด๋ฏธ ๋“ฑ๋ก๋œ ์˜ˆ์ œ๊ฐ€ ๋ฝ‘ํžŒ ๊ฒฝ์šฐ, ๋‹ค์‹œ ์ถ”์ถœ\n",
285
+ " while pick in picks:\n",
286
+ " pick = random.randint(0, len(dataset)-1)\n",
287
+ "\n",
288
+ " picks.append(pick)\n",
289
+ "\n",
290
+ " # ์ž„์˜๋กœ ์ถ”์ถœ๋œ ์ธ๋ฑ์Šค๋“ค๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์„ ์–ธ\n",
291
+ " df = pd.DataFrame(dataset[picks])\n",
292
+ "\n",
293
+ " for column, typ in dataset.features.items():\n",
294
+ " # ๋ผ๋ฒจ ํด๋ž˜์Šค๋ฅผ ์ŠคํŠธ๋ง์œผ๋กœ ๋ณ€ํ™˜\n",
295
+ " if isinstance(typ, ClassLabel):\n",
296
+ " df[column] = df[column].transform(lambda i: typ.names[i])\n",
297
+ "\n",
298
+ " display(HTML(df.to_html()))"
299
+ ]
300
+ },
301
+ {
302
+ "cell_type": "markdown",
303
+ "id": "58c8a3aa",
304
+ "metadata": {},
305
+ "source": [
306
+ "๋ฐ์ดํ„ฐ์…‹์„ ์ „๋ฐ˜์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด ์‹œ๊ฐํ™” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค."
307
+ ]
308
+ },
309
+ {
310
+ "cell_type": "code",
311
+ "execution_count": 10,
312
+ "id": "d4500f78",
313
+ "metadata": {},
314
+ "outputs": [
315
+ {
316
+ "data": {
317
+ "text/html": [
318
+ "<table border=\"1\" class=\"dataframe\">\n",
319
+ " <thead>\n",
320
+ " <tr style=\"text-align: right;\">\n",
321
+ " <th></th>\n",
322
+ " <th>premise</th>\n",
323
+ " <th>hypothesis</th>\n",
324
+ " <th>label</th>\n",
325
+ " </tr>\n",
326
+ " </thead>\n",
327
+ " <tbody>\n",
328
+ " <tr>\n",
329
+ " <th>0</th>\n",
330
+ " <td>์ Š๊ณ  ๋Š™์€ ์—ฌ์„ฑ๋“ค์€ ํฐ ์ž”๋””๋ฐญ์—์„œ ์‚ฌํšŒ, ์ •์น˜, ๊ตํšŒ ๋ชจ์ž„์ด ๋  ์ˆ˜ ์žˆ๋Š” ๊ณณ์— ๊ทธ๋ ค์ ธ ์žˆ๋‹ค.</td>\n",
331
+ " <td>ํ•œ ๋ฌด๋ฆฌ์˜ ์—ฌ์„ฑ์ด ๊ตํšŒ ์ง€ํ•˜์‹ค์—์„œ ์ •์น˜์  ํšŒ์˜๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค.</td>\n",
332
+ " <td>contradiction</td>\n",
333
+ " </tr>\n",
334
+ " <tr>\n",
335
+ " <th>1</th>\n",
336
+ " <td>ํ•œ ์‚ฌ๋žŒ์ด ์ฐจ ์˜† ๋ฐ”์œ„๊ฐ€ ๋งŽ์€ ๊ณณ์—์„œ ๋ฌผ๊ฑด์„ ์ค๋Š”๋‹ค.</td>\n",
337
+ " <td>์‚ฌ๋žŒ์ด ๋ฐ–์— ์žˆ๋‹ค.</td>\n",
338
+ " <td>entailment</td>\n",
339
+ " </tr>\n",
340
+ " <tr>\n",
341
+ " <th>2</th>\n",
342
+ " <td>ํ•˜์–€ ์…”์ธ ๋ฅผ ์ž…์€ ๋‚จ์ž๊ฐ€ ์•ผ์™ธ ํ…Œ์ด๋ธ”์—์„œ ์—ฌ์ž ๋งž์€ํŽธ์— ์•‰์•„ ์žˆ๋‹ค.</td>\n",
343
+ " <td>์ปค๋‹ค๋ž€ ์ธ๊ฐ„์ด ์•‰์•„ ์žˆ์—ˆ๋‹ค.</td>\n",
344
+ " <td>neutral</td>\n",
345
+ " </tr>\n",
346
+ " <tr>\n",
347
+ " <th>3</th>\n",
348
+ " <td>์˜ค๋ Œ์ง€์ƒ‰ ์กฐ๋ผ๋ฅผ ์ž…์€ ๋‘ ๋ช…์˜ ์ง์›์ด ์„œ๋ฅ˜ ์ž‘์—…์„ ํ•˜๊ณ  ์žˆ๋‹ค.</td>\n",
349
+ " <td>์‚ฌ๋žŒ๋“ค์ด ์ž๊ณ  ์žˆ๋‹ค.</td>\n",
350
+ " <td>contradiction</td>\n",
351
+ " </tr>\n",
352
+ " <tr>\n",
353
+ " <th>4</th>\n",
354
+ " <td>๋นจ๊ฐ„ ์ €์ง€๋ฅผ ์ž…์€ ์ถ•๊ตฌ์„ ์ˆ˜๊ฐ€ ๊ณต์„ ๋˜์ง€๋ ค๊ณ  ํ•˜๋Š” ํ•˜์–€ ์ €์ง€๋ฅผ ์ž…์€ ์ถ•๊ตฌ์„ ์ˆ˜์™€ ์‹ธ์šฐ๊ณ  ์žˆ๋‹ค.</td>\n",
355
+ " <td>์ˆ˜๋…€๊ฐ€ ์นดํŽ˜์—์„œ ์ปคํ”ผ๋ฅผ ์ฃผ๋ฌธํ•œ๋‹ค.</td>\n",
356
+ " <td>contradiction</td>\n",
357
+ " </tr>\n",
358
+ " <tr>\n",
359
+ " <th>5</th>\n",
360
+ " <td>๋…ธ๋ž€ ์›ํ”ผ์Šค์™€ ๋ถ„ํ™์ƒ‰ ์ƒŒ๋“ค์„ ์‹ ์€ ์•„์ด๊ฐ€ ์ปค๋‹ค๋ž€ ๋ฌผ์—์„œ ๊ฑธ์–ด๊ฐ„๋‹ค.</td>\n",
361
+ " <td>ํ•œ ์•„์ด๊ฐ€ ๊ทธ๋…€์˜ ๊ฐœ๋ฅผ ํ˜ธ์ˆ˜์—์„œ ์ซ“์•„๋‚ธ๋‹ค.</td>\n",
362
+ " <td>neutral</td>\n",
363
+ " </tr>\n",
364
+ " <tr>\n",
365
+ " <th>6</th>\n",
366
+ " <td>์ดˆ๋ผํ•œ ๋“œ๋ ˆ์Šค๋ฅผ ์ž…์€ ์—ฌ์ž.</td>\n",
367
+ " <td>๊ทธ ๋“œ๋ ˆ์Šค๋Š” ์•ฝ๊ฐ„ ์Œ€์Œ€ํ•˜๋‹ค.</td>\n",
368
+ " <td>neutral</td>\n",
369
+ " </tr>\n",
370
+ " <tr>\n",
371
+ " <th>7</th>\n",
372
+ " <td>๋‘ ๋งˆ๋ฆฌ์˜ ๋ง์ด ํ•œ ์—ฌ์ž๋ฅผ ์ˆ˜๋ ˆ์— ํƒœ์šฐ๊ณ  ์žˆ๋‹ค.</td>\n",
373
+ " <td>์—ฌ์ž๊ฐ€ ๋งˆ์ฐจ๋ฅผ ํƒ€๊ณ  ๋ง์„ ํƒ„๋‹ค.</td>\n",
374
+ " <td>entailment</td>\n",
375
+ " </tr>\n",
376
+ " <tr>\n",
377
+ " <th>8</th>\n",
378
+ " <td>๋‚™์„œ๋กœ ๋’ค๋ฎ์ธ ๋ฒฝ๋Œ๋‹ด์„ ์ง€๋‚˜๊ฐ€๋Š” ๋‚จ์ž์™€ ์—ฌ์ž.</td>\n",
379
+ " <td>๋‘ ์‚ฌ๋žŒ์ด ๋ฒฝ์— ๊ฑธ๋ฆฐ ๋‚™์„œ๋ฅผ ์ง€๋‚˜๊ฐ„๋‹ค.</td>\n",
380
+ " <td>entailment</td>\n",
381
+ " </tr>\n",
382
+ " <tr>\n",
383
+ " <th>9</th>\n",
384
+ " <td>์ธ๋„ ์—ฌ์„ฑ๋“ค์ด ์ „ํ†ต๋ฌด์šฉ์— ์†์„ ์žก๊ณ  ์žˆ๋‹ค.</td>\n",
385
+ " <td>์ธ๋„ ์—ฌ์„ฑ๋“ค์€ ๊ฒฐํ˜ผ์‹์—์„œ ์ „ํ†ต ์ถค์„ ์ถ˜๋‹ค.</td>\n",
386
+ " <td>neutral</td>\n",
387
+ " </tr>\n",
388
+ " </tbody>\n",
389
+ "</table>"
390
+ ],
391
+ "text/plain": [
392
+ "<IPython.core.display.HTML object>"
393
+ ]
394
+ },
395
+ "metadata": {},
396
+ "output_type": "display_data"
397
+ }
398
+ ],
399
+ "source": [
400
+ "show_random_elements(datasets[\"train\"])"
401
+ ]
402
+ },
403
+ {
404
+ "cell_type": "markdown",
405
+ "id": "5dc9ddf3",
406
+ "metadata": {},
407
+ "source": [
408
+ "์•ž์„œ ์ •์˜ํ•œ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž„์˜์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.\n",
409
+ "\n",
410
+ "KorNLI์—๋Š” ๊ด€๊ณ„ ์ถ”๋ก ์„ ํ•  ๋•Œ entailment, neutral ๊ทธ๋ฆฌ๊ณ  contradiction ์„ธ ๊ฐœ์˜ ๋ผ๋ฒจ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.\n",
411
+ "\n",
412
+ "์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๊ฐ ๋ผ๋ฒจ์— ์–ด๋– ํ•œ ๋ฌธ์žฅ๋“ค์ด ํ•ด๋‹น๋˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์–ด ๊ฐ์„ ์ตํž ์ˆ˜ ์žˆ๋Š”๋ฐ์— ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค."
413
+ ]
414
+ },
415
+ {
416
+ "cell_type": "code",
417
+ "execution_count": 11,
418
+ "id": "e1dbaa2b",
419
+ "metadata": {},
420
+ "outputs": [],
421
+ "source": [
422
+ "metric = load_metric(\"glue\", \"qnli\")"
423
+ ]
424
+ },
425
+ {
426
+ "cell_type": "markdown",
427
+ "id": "d112985e",
428
+ "metadata": {},
429
+ "source": [
430
+ "ํ›ˆ๋ จ ๊ณผ์ • ์ค‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํŒŒ์•…์„ ์œ„ํ•ด ๋ฉ”ํŠธ๋ฆญ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.\n",
431
+ "\n",
432
+ "datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—๋Š” ์ด๋ฏธ ๊ตฌํ˜„๋œ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” load_metric ํ•จ์ˆ˜๊ฐ€ ์žˆ์–ด์„œ ์‰ฝ๊ฒŒ ์„ค์ •์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.\n",
433
+ "\n",
434
+ "๊ทธ ์ค‘ GLUE ๋ฐ์ดํ„ฐ์…‹์˜ qnli ํƒœ์Šคํฌ์˜ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค."
435
+ ]
436
+ },
437
+ {
438
+ "cell_type": "code",
439
+ "execution_count": 12,
440
+ "id": "7be908b6",
441
+ "metadata": {},
442
+ "outputs": [
443
+ {
444
+ "data": {
445
+ "text/plain": [
446
+ "(array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,\n",
447
+ " 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,\n",
448
+ " 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0]),\n",
449
+ " array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,\n",
450
+ " 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,\n",
451
+ " 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]))"
452
+ ]
453
+ },
454
+ "execution_count": 12,
455
+ "metadata": {},
456
+ "output_type": "execute_result"
457
+ }
458
+ ],
459
+ "source": [
460
+ "fake_preds = np.random.randint(0, 2, size=(64,))\n",
461
+ "fake_labels = np.random.randint(0, 2, size=(64,))\n",
462
+ "fake_preds, fake_labels"
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "markdown",
467
+ "id": "7b7de386",
468
+ "metadata": {},
469
+ "source": [
470
+ "๋ฉ”ํŠธ๋ฆญ์ด ์ •์ƒ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ๋žœ๋คํ•œ ์˜ˆ์ธก ๊ฐ’๊ณผ ๋ผ๋ฒจ ๊ฐ’์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค."
471
+ ]
472
+ },
473
+ {
474
+ "cell_type": "code",
475
+ "execution_count": 13,
476
+ "id": "b3cce09f",
477
+ "metadata": {},
478
+ "outputs": [
479
+ {
480
+ "data": {
481
+ "text/plain": [
482
+ "{'accuracy': 0.5}"
483
+ ]
484
+ },
485
+ "execution_count": 13,
486
+ "metadata": {},
487
+ "output_type": "execute_result"
488
+ }
489
+ ],
490
+ "source": [
491
+ "metric.compute(predictions=fake_preds, references=fake_labels)"
492
+ ]
493
+ },
494
+ {
495
+ "cell_type": "markdown",
496
+ "id": "a040c7e9",
497
+ "metadata": {},
498
+ "source": [
499
+ "์ƒ์„ฑํ•œ ๋žœ๋ค ๊ฐ’๋“ค์„ compute() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ž˜ ๋™์ž‘ํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค."
500
+ ]
501
+ },
502
+ {
503
+ "cell_type": "code",
504
+ "execution_count": 14,
505
+ "id": "6b7eb994",
506
+ "metadata": {},
507
+ "outputs": [],
508
+ "source": [
509
+ "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)"
510
+ ]
511
+ },
512
+ {
513
+ "cell_type": "markdown",
514
+ "id": "439ef8aa",
515
+ "metadata": {},
516
+ "source": [
517
+ "์ด์ œ ํ•™์Šต์— ํ™œ์šฉํ•  ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋กœ๋“œํ•ด์˜ต๋‹ˆ๋‹ค."
518
+ ]
519
+ },
520
+ {
521
+ "cell_type": "code",
522
+ "execution_count": 15,
523
+ "id": "88f32f23",
524
+ "metadata": {},
525
+ "outputs": [
526
+ {
527
+ "data": {
528
+ "text/plain": [
529
+ "{'input_ids': [0, 1891, 3611, 2052, 3855, 2069, 3627, 1041, 2069, 4484, 2067, 2089, 2088, 1513, 2062, 18, 2, 1041, 2069, 1763, 3611, 2052, 8514, 2336, 7046, 7587, 5603, 17290, 2062, 18, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}"
530
+ ]
531
+ },
532
+ "execution_count": 15,
533
+ "metadata": {},
534
+ "output_type": "execute_result"
535
+ }
536
+ ],
537
+ "source": [
538
+ "tokenizer(\"ํ•œ ์‚ฌ๋žŒ์ด ๊ฒฝ์Ÿ์„ ์œ„ํ•ด ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ์žˆ๋‹ค.\", \"๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.\")"
539
+ ]
540
+ },
541
+ {
542
+ "cell_type": "markdown",
543
+ "id": "fcde8cd6",
544
+ "metadata": {},
545
+ "source": [
546
+ "๋กœ๋“œ๋œ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋‘ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜๋Š” ๋ฐฉ์‹์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๋ฌธ์žฅ์„ ์ž…๋ ฅ ๊ฐ’์œผ๋กœ ๋„ฃ์–ด์„œ ํ™•์ธํ•ด ๋ด…๋‹ˆ๋‹ค."
547
+ ]
548
+ },
549
+ {
550
+ "cell_type": "code",
551
+ "execution_count": 16,
552
+ "id": "6a927431",
553
+ "metadata": {},
554
+ "outputs": [
555
+ {
556
+ "name": "stdout",
557
+ "output_type": "stream",
558
+ "text": [
559
+ "Sentence 1: ๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.\n",
560
+ "Sentence 2: ํ•œ ์‚ฌ๋žŒ์ด ๊ฒฝ์Ÿ์„ ์œ„ํ•ด ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ์žˆ๋‹ค.\n"
561
+ ]
562
+ }
563
+ ],
564
+ "source": [
565
+ "sentence1_key, sentence2_key = (\"premise\", \"hypothesis\")\n",
566
+ "print(f\"Sentence 1: {datasets['train'][0][sentence1_key]}\")\n",
567
+ "print(f\"Sentence 2: {datasets['train'][0][sentence2_key]}\")"
568
+ ]
569
+ },
570
+ {
571
+ "cell_type": "markdown",
572
+ "id": "0fb0ba8b",
573
+ "metadata": {},
574
+ "source": [
575
+ "์ด์ œ ์•ž์„œ ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ ๋ฌธ์žฅ์— ํ•ด๋‹นํ•˜๋Š” value๋ฅผ ๋ฝ‘์•„์ฃผ๊ธฐ ์œ„ํ•œ key๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.\n",
576
+ "\n",
577
+ "KorNLI ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋‘ ๋ฌธ์žฅ์˜ label์ด ๊ฐ๊ฐ premise์™€ hypothesis๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์ •์˜๋œ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€์œผ๋ฏ€๋กœ, ๋‘ ๋ฌธ์žฅ์˜ key ๋Š” ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ๊ฐ premise, hypothesis๊ฐ€ ๋ฉ๋‹ˆ๋‹ค."
578
+ ]
579
+ },
580
+ {
581
+ "cell_type": "code",
582
+ "execution_count": 20,
583
+ "id": "4e020138",
584
+ "metadata": {},
585
+ "outputs": [],
586
+ "source": [
587
+ "def preprocess_function(examples):\n",
588
+ " print(examples)\n",
589
+ " return tokenizer(\n",
590
+ " examples[sentence1_key],\n",
591
+ " examples[sentence2_key],\n",
592
+ " truncation=True,\n",
593
+ " return_token_type_ids=False,\n",
594
+ " )"
595
+ ]
596
+ },
597
+ {
598
+ "cell_type": "markdown",
599
+ "id": "42c8f5a8",
600
+ "metadata": {},
601
+ "source": [
602
+ "์ด์ œ key ๋„ ํ™•์ธ์ด ๋˜์—ˆ์œผ๋‹ˆ, ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ ์˜ˆ์ œ๋“ค์„ ๋ฝ‘์•„์™€ ํ† ํฐํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋ฅผ ์œ„์™€ ๊ฐ™์ด ์ •์˜ํ•ด์ค๋‹ˆ๋‹ค.\n",
603
+ "\n",
604
+ "ํ•ด๋‹น ํ•จ์ˆ˜๋Š” ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์•ž์„œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฏธ๋ฆฌ ํ† ํฐํ™” ์‹œ์ผœ๋†“๋Š” ์ž‘์—…์„ ์œ„ํ•œ ์ฝœ๋ฐฑ ํ•จ์ˆ˜๋กœ ์‚ฌ์šฉ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.\n",
605
+ "\n",
606
+ "์ธ์ž๋กœ ๋„ฃ์–ด์ฃผ๋Š” truncation๋Š” ๋ชจ๋ธ์ด ์ž…๋ ฅ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๊ธธ์ด ์ด์ƒ์˜ ํ† ํฐ ์‹œํ€€์Šค๊ฐ€ ๋“ค์–ด์˜ค๊ฒŒ ๋  ๊ฒฝ์šฐ, ์ตœ๋Œ€ ๊ธธ์ด ๊ธฐ์ค€์œผ๋กœ ์‹œํ€€์Šค๋ฅผ ์ž๋ฅด๋ผ๋Š” ์˜๋ฏธ๋ฅผ ์ง€๋‹™๋‹ˆ๋‹ค.\n",
607
+ "\n",
608
+ "( * return_token_type_ids๋Š” ํ† ํฌ๋‚˜์ด์ €๊ฐ€ token_type_ids๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ํ•  ๊ฒƒ์ธ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์ธ์ž์ž…๋‹ˆ๋‹ค. transformers==4.7.0 ๊ธฐ์ค€์œผ๋กœ token_type_ids๊ฐ€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐ˜ํ™˜๋˜๋ฏ€๋กœ token_type_ids ์ž์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” RoBERTa ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๋‹น ์ธ์ž๋ฅผ False๋กœ ์„ค์ •ํ•ด์ฃผ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.)"
609
+ ]
610
+ },
611
+ {
612
+ "cell_type": "code",
613
+ "execution_count": 21,
614
+ "id": "ac8d69bd",
615
+ "metadata": {},
616
+ "outputs": [
617
+ {
618
+ "name": "stdout",
619
+ "output_type": "stream",
620
+ "text": [
621
+ "{'premise': ['๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.', '๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.', '๋ง์„ ํƒ„ ์‚ฌ๋žŒ์ด ๊ณ ์žฅ๋‚œ ๋น„ํ–‰๊ธฐ ์œ„๋กœ ๋›ฐ์–ด์˜ค๋ฅธ๋‹ค.', '์นด๋ฉ”๋ผ์— ์›ƒ๊ณ  ์†์„ ํ”๋“œ๋Š” ์•„์ด๋“ค', '์นด๋ฉ”๋ผ์— ์›ƒ๊ณ  ์†์„ ํ”๋“œ๋Š” ์•„์ด๋“ค', '์นด๋ฉ”๋ผ์— ์›ƒ๊ณ  ์†์„ ํ”๋“œ๋Š” ์•„์ด๋“ค', 'ํ•œ ์†Œ๋…„์ด ๋นจ๊ฐ„ ๋‹ค๋ฆฌ ํ•œ๊ฐ€์šด๋ฐ ์Šค์ผ€์ดํŠธ๋ณด๋“œ์— ๋›ฐ์–ด์˜ค๋ฅด๊ณ  ์žˆ๋‹ค.', 'ํ•œ ์†Œ๋…„์ด ๋นจ๊ฐ„ ๋‹ค๋ฆฌ ํ•œ๊ฐ€์šด๋ฐ ์Šค์ผ€์ดํŠธ๋ณด๋“œ์— ๋›ฐ์–ด์˜ค๋ฅด๊ณ  ์žˆ๋‹ค.', 'ํ•œ ์†Œ๋…„์ด ๋นจ๊ฐ„ ๋‹ค๋ฆฌ ํ•œ๊ฐ€์šด๋ฐ ์Šค์ผ€์ดํŠธ๋ณด๋“œ์— ๋›ฐ์–ด์˜ค๋ฅด๊ณ  ์žˆ๋‹ค.', '๋‚˜์ด ๋“  ๋‚จ์ž๊ฐ€ ์ปคํ”ผ์ˆ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์— ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋“ค๊ณ  ์•‰์•„ ์žˆ๊ณ  ๋ฐ์€ ์ƒ‰ ์…”์ธ ๋ฅผ ์ž…์€ ์ง์›๋“ค์€ ๋’ค์—์„œ ๋ฏธ์†Œ๋ฅผ ์ง“๊ณ  ์žˆ๋‹ค.', '๋‚˜์ด ๋“  ๋‚จ์ž๊ฐ€ ์ปคํ”ผ์ˆ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์— ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋“ค๊ณ  ์•‰์•„ ์žˆ๊ณ  ๋ฐ์€ ์ƒ‰ ์…”์ธ ๋ฅผ ์ž…์€ ์ง์›๋“ค์€ ๋’ค์—์„œ ๋ฏธ์†Œ๋ฅผ ์ง“๊ณ  ์žˆ๋‹ค.', '๋‚˜์ด ๋“  ๋‚จ์ž๊ฐ€ ์ปคํ”ผ์ˆ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์— ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋“ค๊ณ  ์•‰์•„ ์žˆ๊ณ  ๋ฐ์€ ์ƒ‰ ์…”์ธ ๋ฅผ ์ž…์€ ์ง์›๋“ค์€ ๋’ค์—์„œ ๋ฏธ์†Œ๋ฅผ ์ง“๊ณ  ์žˆ๋‹ค.', '๊ธˆ๋ฐœ ์—ฌ์ž ๋‘˜์ด ์„œ๋กœ ๊ปด์•ˆ๊ณ  ์žˆ๋‹ค.', '๊ธˆ๋ฐœ ์—ฌ์ž ๋‘˜์ด ์„œ๋กœ ๊ปด์•ˆ๊ณ  ์žˆ๋‹ค.', '๊ธˆ๋ฐœ ์—ฌ์ž ๋‘˜์ด ์„œ๋กœ ๊ปด์•ˆ๊ณ  ์žˆ๋‹ค.', '์‹๋‹น์— ์žˆ๋Š” ๋ช‡๋ช‡ ์‚ฌ๋žŒ๋“ค, ๊ทธ๋“ค ์ค‘ ํ•œ ๋ช…์€ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', '์‹๋‹น์— ์žˆ๋Š” ๋ช‡๋ช‡ ์‚ฌ๋žŒ๋“ค, ๊ทธ๋“ค ์ค‘ ํ•œ ๋ช…์€ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', '์‹๋‹น์— ์žˆ๋Š” ๋ช‡๋ช‡ ์‚ฌ๋žŒ๋“ค, ๊ทธ๋“ค ์ค‘ ํ•œ ๋ช…์€ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', 'ํ•œ ๋‚˜์ด๋“  ๋‚จ์ž๊ฐ€ ์‹๋‹น์—์„œ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', 'ํ•œ ๋‚˜์ด๋“  ๋‚จ์ž๊ฐ€ ์‹๋‹น์—์„œ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', 'ํ•œ ๋‚˜์ด๋“  ๋‚จ์ž๊ฐ€ ์‹๋‹น์—์„œ ์˜ค๋ Œ์ง€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', '๊ธˆ๋ฐœ์˜ ๋‚จ์ž, ๊ทธ๋ฆฌ๊ณ  ๊ณต๊ณต ๋ถ„์ˆ˜๋Œ€์—์„œ ์ˆ ์„ ๋งˆ์‹œ๋Š” ๊ฐˆ์ƒ‰ ์…”์ธ .', '๊ธˆ๋ฐœ์˜ ๋‚จ์ž, ๊ทธ๋ฆฌ๊ณ  ๊ณต๊ณต ๋ถ„์ˆ˜๋Œ€์—์„œ ์ˆ ์„ ๋งˆ์‹œ๋Š” ๊ฐˆ์ƒ‰ ์…”์ธ .', '๊ธˆ๋ฐœ์˜ ๋‚จ์ž, ๊ทธ๋ฆฌ๊ณ  ๊ณต๊ณต ๋ถ„์ˆ˜๋Œ€์—์„œ ์ˆ ์„ ๋งˆ์‹œ๋Š” ๊ฐˆ์ƒ‰ ์…”์ธ .', '๋ฐฉ๊ธˆ ์ ์‹ฌ์„ ๋จน๊ณ  ์ž‘๋ณ„์ธ์‚ฌ๋ฅผ ํ•œ ๋‘ ์—ฌ์ž.', '๋ฐฉ๊ธˆ ์ ์‹ฌ์„ ๋จน๊ณ  ์ž‘๋ณ„์ธ์‚ฌ๋ฅผ ํ•œ ๋‘ ์—ฌ์ž.', '๋ฐฉ๊ธˆ ์ ์‹ฌ์„ ๋จน๊ณ  ์ž‘๋ณ„์ธ์‚ฌ๋ฅผ ํ•œ ๋‘ ์—ฌ์ž.', '๋‘ ์—ฌ์ž๊ฐ€ ์Œ์‹ ์šด๋ฐ˜ ์šฉ๊ธฐ๋ฅผ ๋“ค๊ณ  ํฌ์˜น์„ ํ–ˆ๋‹ค.', '๋‘ ์—ฌ์ž๊ฐ€ ์Œ์‹ ์šด๋ฐ˜ ์šฉ๊ธฐ๋ฅผ ๋“ค๊ณ  ํฌ์˜น์„ ํ–ˆ๋‹ค.', '๋‘ ์—ฌ์ž๊ฐ€ ์Œ์‹ ์šด๋ฐ˜ ์šฉ๊ธฐ๋ฅผ ๋“ค๊ณ  ํฌ์˜น์„ ํ–ˆ๋‹ค.', '๋ฆฌํ‹€๋ฆฌ๊ทธ ํŒ€์ด ์˜คํ›„ ๊ฒฝ๊ธฐ์—์„œ ๋ฒ ์ด์Šค๋กœ ๋ฏธ๋„๋Ÿฌ์ง€๋Š” ์ฃผ์ž๋ฅผ ์žก์œผ๋ ค๊ณ  ํ•œ๋‹ค.', '๋ฆฌํ‹€๋ฆฌ๊ทธ ํŒ€์ด ์˜คํ›„ ๊ฒฝ๊ธฐ์—์„œ ๋ฒ ์ด์Šค๋กœ ๋ฏธ๋„๋Ÿฌ์ง€๋Š” ์ฃผ์ž๋ฅผ ์žก์œผ๋ ค๊ณ  ํ•œ๋‹ค.', '๋ฆฌํ‹€๋ฆฌ๊ทธ ํŒ€์ด ์˜คํ›„ ๊ฒฝ๊ธฐ์—์„œ ๋ฒ ์ด์Šค๋กœ ๋ฏธ๋„๋Ÿฌ์ง€๋Š” ์ฃผ์ž๋ฅผ ์žก์œผ๋ ค๊ณ  ํ•œ๋‹ค.', '์ด ํ•™๊ต๋Š” ๋‹ค๋ฅธ ๋ฌธํ™”๋“ค์ด ํŒŒํ‹ฐ์—์„œ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ์–ด์ง€๋Š”์ง€์— ๋Œ€ํ•œ ๋ฏธ๊ตญ ๋ฌธํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํ•œ ํ–‰์‚ฌ๋ฅผ ์—ด๊ณ  ์žˆ๋‹ค.', '์ด ํ•™๊ต๋Š” ๋‹ค๋ฅธ ๋ฌธํ™”๋“ค์ด ํŒŒํ‹ฐ์—์„œ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ์–ด์ง€๋Š”์ง€์— ๋Œ€ํ•œ ๋ฏธ๊ตญ ๋ฌธํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํ•œ ํ–‰์‚ฌ๋ฅผ ์—ด๊ณ  ์žˆ๋‹ค.', '์ด ํ•™๊ต๋Š” ๋‹ค๋ฅธ ๋ฌธํ™”๋“ค์ด ํŒŒํ‹ฐ์—์„œ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ์–ด์ง€๋Š”์ง€์— ๋Œ€ํ•œ ๋ฏธ๊ตญ ๋ฌธํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํ•œ ํ–‰์‚ฌ๋ฅผ ์—ด๊ณ  ์žˆ๋‹ค.', '๊ณ ๊ธ‰ ํŒจ์…˜ ์•„๊ฐ€์”จ๋“ค์ด ๋„์‹œ์˜ ์‚ฌ๋žŒ๋“ค ์˜†์—์„œ ์ „์ฐจ ๋ฐ–์—์„œ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋‹ค.', '๊ณ ๊ธ‰ ํŒจ์…˜ ์•„๊ฐ€์”จ๋“ค์ด ๋„์‹œ์˜ ์‚ฌ๋žŒ๋“ค ์˜†์—์„œ ์ „์ฐจ ๋ฐ–์—์„œ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋‹ค.', '๊ณ ๊ธ‰ ํŒจ์…˜ ์•„๊ฐ€์”จ๋“ค์ด ๋„์‹œ์˜ ์‚ฌ๋žŒ๋“ค ์˜†์—์„œ ์ „์ฐจ ๋ฐ–์—์„œ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋‹ค.', 'ํ•ด๋ณ€์—์„œ ์ฆ๊ธฐ๋Š” ๋‚จ์ž, ์—ฌ์ž, ์•„์ด.'], 'hypothesis': ['ํ•œ ์‚ฌ๋žŒ์ด ๊ฒฝ์Ÿ์„ ์œ„ํ•ด ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ์žˆ๋‹ค.', 'ํ•œ ์‚ฌ๋žŒ์ด ์‹๋‹น์—์„œ ์˜ค๋ฏˆ๋ ›์„ ์ฃผ๋ฌธํ•˜๊ณ  ์žˆ๋‹ค.', '์‚ฌ๋žŒ์€ ์•ผ์™ธ์—์„œ ๋ง์„ ํƒ€๊ณ  ์žˆ๋‹ค.', '๊ทธ๋“ค์€ ๋ถ€๋ชจ๋‹˜์„ ๋ณด๊ณ  ์›ƒ๊ณ  ์žˆ๋‹ค', '์•„์ด๋“ค์ด ์žˆ๋‹ค', '์•„์ด๋“ค์ด ์–ผ๊ตด์„ ์ฐŒํ‘ธ๋ฆฌ๊ณ  ์žˆ๋‹ค', '์†Œ๋…„์€ ์ธ๋„๋ฅผ ๋”ฐ๋ผ ์Šค์ผ€์ดํŠธ๋ฅผ ํƒ„๋‹ค.', '๊ทธ ์†Œ๋…„์€ ์Šค์ผ€์ดํŠธ๋ณด๋“œ๋ฅผ ํƒ€๋Š” ๋ฌ˜๊ธฐ๋ฅผ ๋ถ€๋ฆฐ๋‹ค.', '์†Œ๋…„์ด ์•ˆ์ „ ์žฅ๋น„๋ฅผ ์ฐฉ์šฉํ•˜๊ณ  ์žˆ๋‹ค.', '๋‚˜์ด ๋“  ๋‚จ์ž๊ฐ€ ๋”ธ์ด ํ‡ด๊ทผํ•˜๊ธฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋ฉด์„œ ์ฃผ์Šค๋ฅผ ๋งˆ์‹ ๋‹ค.', 'ํ•œ ์†Œ๋…„์ด ํ–„๋ฒ„๊ฑฐ๋ฅผ ๋’ค์ง‘๋Š”๋‹ค.', 'ํ•œ ๋…ธ์ธ์ด ์ž‘์€ ๊ฐ€๊ฒŒ์— ์•‰์•„ ์žˆ๋‹ค.', '๋ช‡๋ช‡ ์—ฌ์„ฑ๋“ค์€ ํœด๊ฐ€ ๋•Œ ํฌ์˜น์„ ํ•˜๊ณ  ์žˆ๋‹ค.', '์—ฌ์ž๋“ค์ด ์ž๊ณ  ์žˆ๋‹ค.', '์• ์ •์„ ๋ณด์ด๋Š” ์—ฌ์ž๋“ค์ด ์žˆ๋‹ค.', '์‚ฌ๋žŒ๋“ค์ด ์˜ค๋ฏˆ๋ ›์„ ๋จน๊ณ  ์žˆ๏ฟฝ๏ฟฝ.', '์‚ฌ๋žŒ๋“ค์ด ํ•™๊ต ์ฑ…์ƒ์— ์•‰์•„ ์žˆ๋‹ค.', '์†๋‹˜๋“ค์ด ์‹๋‹น์— ์žˆ๋‹ค.', '๋‚จ์ž๊ฐ€ ์ฃผ์Šค๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', '๋‘ ์—ฌ์ž๊ฐ€ ์‹๋‹น์—์„œ ์™€์ธ์„ ๋งˆ์‹œ๊ณ  ์žˆ๋‹ค.', '์‹๋‹น์—์„œ ํ•œ ๋‚จ์ž๊ฐ€ ์‹์‚ฌ๊ฐ€ ๋„์ฐฉํ•˜๊ธฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋‹ค.', '๊ณต์›์˜ ๋ถ„์ˆ˜๋Œ€์—์„œ ๋ฌผ์„ ๋งˆ์‹œ๋Š” ๊ธˆ๋ฐœ์˜ ๋‚จ์ž.', '๊ฐˆ์ƒ‰ ์…”์ธ ๋ฅผ ์ž…์€ ๊ธˆ๋ฐœ ๋‚จ์ž๊ฐ€ ๊ณต์›์˜ ๋ฒค์น˜์—์„œ ์ฑ…์„ ์ฝ๊ณ  ์žˆ๋‹ค.', '๋ถ„์ˆ˜๋Œ€์—์„œ ๋ฌผ์„ ๋งˆ์‹œ๋Š” ๊ธˆ๋ฐœ ๋‚จ์ž.', '์นœ๊ตฌ๋“ค์€ ์ €๋… ์‹ํƒ์„ ๊ฐ€๋“ ์ฑ„์šฐ๊ณ  ์„œ๋กœ๋ฅผ ๋…ธ๋ ค๋ณธ๋‹ค.', '์ด ์‚ฌ์ง„์—๋Š” ๋‘ ์—ฌ์ž๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.', '์นœ๊ตฌ๋“ค์€ 20๋…„ ๋งŒ์— ์ฒ˜์Œ ๋งŒ๋‚ฌ๊ณ , ๋”ฐ๋ผ์žก๋Š” ๋ฐ ์ฆ๊ฑฐ์šด ์‹œ๊ฐ„์„ ๋ณด๋ƒˆ๋‹ค.', '๋‘ ์ž๋งค๋Š” ๋ถ๋น„๋Š” ์‹๋‹น์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ์„œ๋กœ๋ฅผ ๋ณด๊ณ  ํฌ์˜น์„ ๋‚˜๋ˆ„๋ฉฐ ๋‘˜ ๋‹ค ๊ฐœ ๊ฐ€๋ฐฉ์„ ์›€์ผœ์žก์•˜๋‹ค.', '๋‘ ๊ทธ๋ฃน์˜ ๋ผ์ด๋ฒŒ ๊ฐฑ๋‹จ์›๋“ค์ด ์„œ๋กœ๋ฅผ ๋ฐฐ์‹ ํ–ˆ๋‹ค.', '๋‘ ์—ฌ์ž๊ฐ€ ์„œ๋กœ ๊ปด์•ˆ๋Š”๋‹ค.', 'ํ•œ ํŒ€์ด ์Šน๋ฆฌ๋ฅผ ๊ฑฐ๋‘๊ธฐ ์œ„ํ•ด ๋“์ ์„ ์‹œ๋„ํ•˜๊ณ  ์žˆ๋‹ค.', 'ํ•œ ํŒ€์ด ์ฃผ์ž๋ฅผ ๋”ฐ๋Œ๋ฆฌ๋ ค ํ•˜๊ณ  ์žˆ๋‹ค.', 'ํ•œ ํŒ€์ด ํ† ์„ฑ์—์„œ ์•ผ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค.', 'ํ•œ ํ•™๊ต๊ฐ€ ๋†๊ตฌ ๊ฒฝ๊ธฐ๋ฅผ ์ฃผ์ตœํ•œ๋‹ค.', 'ํ•œ ๊ณ ๋“ฑํ•™๊ต๊ฐ€ ํ–‰์‚ฌ๋ฅผ ์ฃผ์ตœํ•˜๊ณ  ์žˆ๋‹ค.', 'ํ•œ ํ•™๊ต๊ฐ€ ํ–‰์‚ฌ๋ฅผ ์ฃผ์ตœํ•˜๊ณ  ์žˆ๋‹ค.', '์—ฌ์ž๋“ค์€ ์–ด๋–ค ์˜ท์„ ์ž…๋“  ์ƒ๊ด€ํ•˜์ง€ ์•Š๋Š”๋‹ค.', '์—ฌ์ž๋“ค์ด ์ „์ฐจ ์˜†์—์„œ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋‹ค.', '์—ฌ์„ฑ๋“ค์€ ์ข‹์€ ํŒจ์…˜ ๊ฐ๊ฐ์„ ๊ฐ–๋Š” ๊ฒƒ์„ ์ฆ๊ธด๋‹ค.', '์—ฌ๋ฆ„๋ฐฉํ•™ ๋•Œ ํ•ด๋ณ€์—์„œ ์—„๋งˆ ์•„๋น ์™€ ํ•จ๊ป˜ ์žˆ๋Š” ์•„์ด.'], 'label': [1, 2, 0, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1, 2, 0, 1, 2, 0, 0, 2, 1, 1, 2, 0, 2, 0, 1, 1, 2, 0, 1, 0, 2, 2, 1, 0, 2, 0, 1, 1]}\n"
622
+ ]
623
+ },
624
+ {
625
+ "data": {
626
+ "text/plain": [
627
+ "{'input_ids': [[0, 1041, 2069, 1763, 3611, 2052, 8514, 2336, 7046, 7587, 5603, 17290, 2062, 18, 2, 1891, 3611, 2052, 3855, 2069, 3627, 1041, 2069, 4484, 2067, 2089, 2088, 1513, 2062, 18, 2], [0, 1041, 2069, 1763, 3611, 2052, 8514, 2336, 7046, 7587, 5603, 17290, 2062, 18, 2, 1891, 3611, 2052, 5499, 27135, 3, 4867, 19521, 1513, 2062, 18, 2], [0, 1041, 2069, 1763, 3611, 2052, 8514, 2336, 7046, 7587, 5603, 17290, 2062, 18, 2, 3611, 2073, 8296, 27135, 1041, 2069, 11532, 1513, 2062, 18, 2], [0, 5677, 2170, 1474, 2088, 1284, 2069, 18882, 2259, 3651, 2031, 2, 636, 2031, 2073, 4267, 2098, 2069, 4530, 1474, 2088, 1513, 2062, 2], [0, 5677, 2170, 1474, 2088, 1284, 2069, 18882, 2259, 3651, 2031, 2, 3651, 7285, 1513, 2062, 2], [0, 5677, 2170, 1474, 2088, 1284, 2069, 18882, 2259, 3651, 2031, 2, 3651, 7285, 3977, 2069, 19556, 2088, 1513, 2062, 2], [0, 1891, 5950, 2052, 8013, 5035, 6853, 3832, 19090, 11507, 2170, 5603, 5667, 2088, 1513, 2062, 18, 2, 5950, 2073, 4543, 2138, 3653, 19090, 2138, 25695, 18, 2], [0, 1891, 5950, 2052, 8013, 5035, 6853, 3832, 19090, 11507, 2170, 5603, 5667, 2088, 1513, 2062, 18, 2, 636, 5950, 2073, 19090, 11507, 2138, 1761, 2259, 1087, 2015, 2138, 29121, 2062, 18, 2], [0, 1891, 5950, 2052, 8013, 5035, 6853, 3832, 19090, 11507, 2170, 5603, 5667, 2088, 1513, 2062, 18, 2, 5950, 2052, 4040, 5424, 2138, 7845, 19521, 1513, 2062, 18, 2], [0, 4358, 880, 3997, 2116, 18970, 2079, 1518, 2073, 6889, 2170, 11150, 12228, 2138, 882, 2088, 1379, 2227, 1513, 2088, 1124, 2073, 1245, 10727, 2138, 1511, 2073, 4070, 2031, 2073, 873, 27135, 5658, 2138, 1590, 2088, 1513, 2062, 18, 2, 4358, 880, 3997, 2116, 900, 2052, 8194, 31302, 2138, 5037, 31369, 12228, 2138, 23457, 18, 2], [0, 4358, 880, 3997, 2116, 18970, 2079, 1518, 2073, 6889, 2170, 11150, 12228, 2138, 882, 2088, 1379, 2227, 1513, 2088, 1124, 2073, 1245, 10727, 2138, 1511, 2073, 4070, 2031, 2073, 873, 27135, 5658, 2138, 1590, 2088, 1513, 2062, 18, 2, 1891, 5950, 2052, 14995, 2138, 6762, 2259, 2062, 18, 2], [0, 4358, 880, 3997, 2116, 18970, 2079, 1518, 2073, 6889, 2170, 11150, 12228, 2138, 882, 2088, 1379, 2227, 1513, 2088, 1124, 2073, 1245, 10727, 2138, 1511, 2073, 4070, 2031, 2073, 873, 27135, 5658, 2138, 1590, 2088, 1513, 2062, 18, 2, 1891, 4662, 2052, 1518, 2073, 6042, 2170, 1379, 2227, 1513, 2062, 18, 2], [0, 21459, 3883, 867, 2052, 4084, 15322, 2088, 1513, 2062, 18, 2, 7396, 3811, 2031, 2073, 6493, 904, 20957, 2069, 6159, 1513, 2062, 18, 2], [0, 21459, 3883, 867, 2052, 4084, 15322, 2088, 1513, 2062, 18, 2, 3883, 7285, 5883, 1513, 2062, 18, 2], [0, 21459, 3883, 867, 2052, 4084, 15322, 2088, 1513, 2062, 18, 2, 7689, 2069, 3783, 2259, 3883, 7285, 1513, 2062, 18, 2], [0, 5499, 2170, 1513, 2259, 7396, 3611, 2031, 16, 636, 2031, 1570, 1891, 1076, 2073, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 3611, 7285, 3, 1059, 2088, 1513, 2062, 18, 2], [0, 5499, 2170, 1513, 2259, 7396, 3611, 2031, 16, 636, 2031, 1570, 1891, 1076, 2073, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 3611, 7285, 3741, 7961, 2170, 1379, 2227, 1513, 2062, 18, 2], [0, 5499, 2170, 1513, 2259, 7396, 3611, 2031, 16, 636, 2031, 1570, 1891, 1076, 2073, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 6654, 7285, 5499, 2170, 1513, 2062, 18, 2], [0, 1891, 4358, 2778, 3997, 2116, 5499, 27135, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 3997, 2116, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2], [0, 1891, 4358, 2778, 3997, 2116, 5499, 27135, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 864, 3883, 2116, 5499, 27135, 6612, 2069, 5012, 2088, 1513, 2062, 18, 2], [0, 1891, 4358, 2778, 3997, 2116, 5499, 27135, 11150, 12228, 2138, 5012, 2088, 1513, 2062, 18, 2, 5499, 27135, 1891, 3997, 2116, 5067, 2116, 5082, 31302, 2138, 5037, 2088, 1513, 2062, 18, 2], [0, 21459, 2079, 3997, 16, 3673, 4437, 11790, 2104, 27135, 1299, 2069, 5012, 2259, 14008, 10727, 18, 2, 4599, 2079, 11790, 2104, 27135, 1093, 2069, 5012, 2259, 21459, 2079, 3997, 18, 2], [0, 21459, 2079, 3997, 16, 3673, 4437, 11790, 2104, 27135, 1299, 2069, 5012, 2259, 14008, 10727, 18, 2, 14008, 10727, 2138, 1511, 2073, 21459, 3997, 2116, 4599, 2079, 9262, 27135, 1644, 2069, 1508, 2088, 1513, 2062, 18, 2], [0, 21459, 2079, 3997, 16, 3673, 4437, 11790, 2104, 27135, 1299, 2069, 5012, 2259, 14008, 10727, 18, 2, 11790, 2104, 27135, 1093, 2069, 5012, 2259, 21459, 3997, 18, 2], [0, 9684, 5961, 2069, 1059, 2088, 17812, 2179, 2063, 2138, 1891, 864, 3883, 18, 2, 3949, 2031, 2073, 4750, 10160, 2069, 4983, 8140, 2088, 4084, 2138, 10631, 7471, 18, 2], [0, 9684, 5961, 2069, 1059, 2088, 17812, 2179, 2063, 2138, 1891, 864, 3883, 18, 2, 1504, 4035, 2170, 2259, 864, 3883, 2116, 1513, 2219, 3606, 18, 2], [0, 9684, 5961, 2069, 1059, 2088, 17812, 2179, 2063, 2138, 1891, 864, 3883, 18, 2, 3949, 2031, 2073, 3619, 2440, 1038, 2170, 3790, 5836, 2088, 16, 15856, 2259, 842, 7924, 3641, 2069, 5755, 2062, 18, 2], [0, 864, 3883, 2116, 4182, 11047, 6153, 2138, 882, 2088, 20957, 2069, 1902, 2062, 18, 2, 864, 9646, 2259, 26576, 2259, 5499, 2069, 19642, 4084, 2138, 4530, 20957, 2069, 4835, 2307, 867, 809, 558, 6840, 2069, 13955, 2741, 2886, 2062, 18, 2], [0, 864, 3883, 2116, 4182, 11047, 6153, 2138, 882, 2088, 20957, 2069, 1902, 2062, 18, 2, 864, 4063, 2079, 12574, 563, 2286, 2252, 7285, 4084, 2138, 10004, 2371, 2062, 18, 2], [0, 864, 3883, 2116, 4182, 11047, 6153, 2138, 882, 2088, 20957, 2069, 1902, 2062, 18, 2, 864, 3883, 2116, 4084, 15322, 2259, 2062, 18, 2], [0, 18023, 17665, 1823, 2052, 4082, 3682, 27135, 9763, 2200, 21371, 2259, 6788, 2138, 1523, 2279, 10554, 3605, 18, 2, 1891, 1823, 2052, 4644, 2138, 6387, 2015, 3627, 6300, 2069, 4703, 19521, 1513, 2062, 18, 2], [0, 18023, 17665, 1823, 2052, 4082, 3682, 27135, 9763, 2200, 21371, 2259, 6788, 2138, 1523, 2279, 10554, 3605, 18, 2, 1891, 1823, 2052, 6788, 2138, 18841, 2370, 6159, 1513, 2062, 18, 2], [0, 18023, 17665, 1823, 2052, 4082, 3682, 27135, 9763, 2200, 21371, 2259, 6788, 2138, 1523, 2279, 10554, 3605, 18, 2, 1891, 1823, 2052, 24820, 27135, 4878, 2138, 6159, 1513, 2062, 18, 2], [0, 1504, 3741, 2259, 3656, 3697, 7285, 7291, 27135, 3842, 5778, 4379, 18246, 2170, 3618, 3666, 3697, 2138, 3897, 2223, 2015, 3627, 4014, 2470, 3925, 2138, 1432, 2088, 1513, 2062, 18, 2, 1891, 3741, 2116, 7124, 3682, 2138, 6771, 4538, 18, 2], [0, 1504, 3741, 2259, 3656, 3697, 7285, 7291, 27135, 3842, 5778, 4379, 18246, 2170, 3618, 3666, 3697, 2138, 3897, 2223, 2015, 3627, 4014, 2470, 3925, 2138, 1432, 2088, 1513, 2062, 18, 2, 1891, 5868, 2116, 3925, 2138, 6771, 19521, 1513, 2062, 18, 2], [0, 1504, 3741, 2259, 3656, 3697, 7285, 7291, 27135, 3842, 5778, 4379, 18246, 2170, 3618, 3666, 3697, 2138, 3897, 2223, 2015, 3627, 4014, 2470, 3925, 2138, 1432, 2088, 1513, 2062, 18, 2, 1891, 3741, 2116, 3925, 2138, 6771, 19521, 1513, 2062, 18, 2], [0, 5399, 5179, 9019, 7285, 3763, 2079, 3611, 2031, 1438, 27135, 13533, 4044, 2112, 5037, 2088, 1513, 2062, 18, 2, 3883, 2031, 2073, 3711, 1451, 2069, 1511, 2778, 5468, 2205, 2118, 1380, 2259, 2062, 18, 2], [0, 5399, 5179, 9019, 7285, 3763, 2079, 3611, 2031, 1438, 27135, 13533, 4044, 2112, 5037, 2088, 1513, 2062, 18, 2, 3883, 7285, 13533, 1438, 27135, 5037, 2088, 1513, 2062, 18, 2], [0, 5399, 5179, 9019, 7285, 3763, 2079, 3611, 2031, 1438, 27135, 13533, 4044, 2112, 5037, 2088, 1513, 2062, 18, 2, 3811, 2031, 2073, 1560, 2073, 5179, 5700, 2069, 554, 2259, 575, 2069, 21589, 18, 2], [0, 9738, 27135, 5380, 2259, 3997, 16, 3883, 16, 3651, 18, 2, 4565, 2239, 2218, 904, 9738, 27135, 4122, 5091, 2522, 3655, 1513, 2259, 3651, 18, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}"
628
+ ]
629
+ },
630
+ "execution_count": 21,
631
+ "metadata": {},
632
+ "output_type": "execute_result"
633
+ }
634
+ ],
635
+ "source": [
636
+ "preprocess_function(datasets[\"train\"][:40])"
637
+ ]
638
+ },
639
+ {
640
+ "cell_type": "markdown",
641
+ "id": "7b8b48f7",
642
+ "metadata": {},
643
+ "source": [
644
+ "์•ž์„œ ์ •์˜ํ•œ process_function์€ ์œ„์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜ˆ์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค."
645
+ ]
646
+ },
647
+ {
648
+ "cell_type": "code",
649
+ "execution_count": null,
650
+ "id": "07593d57",
651
+ "metadata": {},
652
+ "outputs": [],
653
+ "source": [
654
+ "encoded_datasets = datasets.map(preprocess_function, batched=True)"
655
+ ]
656
+ },
657
+ {
658
+ "cell_type": "markdown",
659
+ "id": "7bbf64ea",
660
+ "metadata": {},
661
+ "source": [
662
+ "์ด์ œ ์ •์˜๋œ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•ด ๋ฐ์ดํ„ฐ์…‹์„ ๋ฏธ๋ฆฌ ํ† ํฐํ™”์‹œํ‚ค๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.\n",
663
+ "\n",
664
+ "datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง„ DatasetDict ๊ฐ์ฒด๋Š” map() ํ•จ์ˆ˜๋ฅผ ์ง€์›ํ•˜๋ฏ€๋กœ, ์ •์˜๋œ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ๋ฐ์ดํ„ฐ์…‹ ํ† ํฐํ™”๋ฅผ ์œ„ํ•œ ์ฝœ๋ฐฑ ํ•จ์ˆ˜๋กœ map() ํ•จ์ˆ˜ ์ธ์ž๋กœ ๋„˜๊ฒจ์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค."
665
+ ]
666
+ },
667
+ {
668
+ "cell_type": "code",
669
+ "execution_count": null,
670
+ "id": "840516c6",
671
+ "metadata": {},
672
+ "outputs": [],
673
+ "source": [
674
+ "num_labels = 3\n",
675
+ "model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)"
676
+ ]
677
+ },
678
+ {
679
+ "cell_type": "markdown",
680
+ "id": "80281b6f",
681
+ "metadata": {},
682
+ "source": [
683
+ "ํ•™์Šต์„ ์œ„ํ•œ ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.\n",
684
+ "\n",
685
+ "์•ž์„œ ์‚ดํŽด๋ณธ ๋ฐ”์™€ ๊ฐ™์ด KorNLI์—๋Š” ์ด 3๊ฐœ์˜ ํด๋ž˜์Šค๊ฐ€ ์กด์žฌํ•˜๋ฏ€๋กœ, 3๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” SequenceClassification ๊ตฌ์กฐ๋กœ ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.\n",
686
+ "\n",
687
+ "๋ชจ๋ธ์„ ๋กœ๋“œํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ๊ณ  ๋ฌธ๊ตฌ๋Š” ๋‘ ๊ฐ€์ง€ ์˜๋ฏธ๋ฅผ ์ง€๋‹™๋‹ˆ๋‹ค.\n",
688
+ "1. Masked Language Modeling ์„ ์œ„ํ•ด ์กด์žฌํ–ˆ๋˜ lm_head๊ฐ€ ํ˜„์žฌ๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ณ  ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.\n",
689
+ "2. ๋ฌธ์žฅ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ classifier ๋ ˆ์ด์–ด๋ฅผ ๋ฐฑ๋ณธ ๋ชจ๋ธ ๋’ค์— ์ด์–ด ๋ถ™์˜€์œผ๋‚˜ ์•„์ง ํ›ˆ๋ จ์ด ๋˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ, ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค."
690
+ ]
691
+ },
692
+ {
693
+ "cell_type": "code",
694
+ "execution_count": null,
695
+ "id": "ea50be8c",
696
+ "metadata": {},
697
+ "outputs": [],
698
+ "source": [
699
+ "def compute_metrics(eval_pred):\n",
700
+ " predictions, labels = eval_pred\n",
701
+ " predictions = np.argmax(predictions, axis=1)\n",
702
+ " return metric.compute(predictions=predictions, references=labels)"
703
+ ]
704
+ },
705
+ {
706
+ "cell_type": "markdown",
707
+ "id": "d62aef1e",
708
+ "metadata": {},
709
+ "source": [
710
+ "๋งˆ์ง€๋ง‰์œผ๋กœ ์•ž์„œ ์ •์˜ํ•œ ๋ฉ”ํŠธ๋ฆญ์„ ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐ๊ณผ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.\n",
711
+ "\n",
712
+ "์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” eval_pred๋Š” EvalPrediction ๊ฐ์ฒด์ด๋ฉฐ, ๋ชจ๋ธ์˜ ํด๋ž˜์Šค ๋ณ„ ์˜ˆ์ธก ๊ฐ’๊ณผ ์ •๋‹ต ๊ฐ’์„ ์ง€๋‹™๋‹ˆ๋‹ค.\n",
713
+ "\n",
714
+ "ํด๋ž˜์Šค ๋ณ„ ์˜ˆ์ธก ์ค‘ ๊ฐ€์žฅ ๋†’์€ ๋ผ๋ฒจ์„ argmax()๋ฅผ ํ†ตํ•ด ๋ฝ‘์•„๋‚ธ ํ›„, ์ •๋‹ต ๋ผ๋ฒจ๊ณผ ๋น„๊ต๋ฅผ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค."
715
+ ]
716
+ },
717
+ {
718
+ "cell_type": "code",
719
+ "execution_count": null,
720
+ "id": "09688900",
721
+ "metadata": {},
722
+ "outputs": [],
723
+ "source": [
724
+ "metric_name = \"accuracy\"\n",
725
+ "\n",
726
+ "args = TrainingArguments(\n",
727
+ " \"test-nli\",\n",
728
+ " evaluation_strategy=\"epoch\",\n",
729
+ " learning_rate=2e-5,\n",
730
+ " per_device_train_batch_size=batch_size,\n",
731
+ " per_device_eval_batch_size=batch_size,\n",
732
+ " num_train_epochs=5,\n",
733
+ " weight_decay=0.01,\n",
734
+ " load_best_model_at_end=True,\n",
735
+ " metric_for_best_model=metric_name,\n",
736
+ ")"
737
+ ]
738
+ },
739
+ {
740
+ "cell_type": "markdown",
741
+ "id": "9a768407",
742
+ "metadata": {},
743
+ "source": [
744
+ "์ด์ œ ์•ž์„œ ์ •์˜ํ•œ ์ •๋ณด๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ transformers์—์„œ ์ œ๊ณตํ•˜๋Š” Trainer ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ธ์ž ๊ด€๋ฆฌ ํด๋ž˜์Šค๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.\n",
745
+ "\n",
746
+ "metric_name์€ ์•ž์„œ ์–ป์–ด์ง„ ๋ฉ”ํŠธ๋ฆญ ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ–ˆ์„ ๋•Œ, ์•„๋ž˜์™€ ๊ฐ™์ด dict ํ˜•์‹์œผ๋กœ ๊ฒฐ๊ณผ ๊ฐ’์ด ๋ฐ˜ํ™˜๋˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  key ๋ฅผ ์ •์˜ํ•ด์ค€๋‹ค๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.\n"
747
+ ]
748
+ },
749
+ {
750
+ "cell_type": "code",
751
+ "execution_count": null,
752
+ "id": "080c165e",
753
+ "metadata": {},
754
+ "outputs": [],
755
+ "source": [
756
+ "trainer = Trainer(\n",
757
+ " model,\n",
758
+ " args,\n",
759
+ " train_dataset=encoded_datasets[\"train\"],\n",
760
+ " eval_dataset=encoded_datasets[\"validation\"],\n",
761
+ " tokenizer=tokenizer,\n",
762
+ " compute_metrics=compute_metrics,\n",
763
+ ")"
764
+ ]
765
+ },
766
+ {
767
+ "cell_type": "markdown",
768
+ "id": "03a9041d",
769
+ "metadata": {},
770
+ "source": [
771
+ "์ด์ œ ๋กœ๋“œํ•œ ๋ชจ๋ธ, ์ธ์ž ๊ด€๋ฆฌ ํด๋ž˜์Šค, ๋ฐ์ดํ„ฐ์…‹ ๋“ฑ์„ ์ด์šฉํ•˜์—ฌ Trainer๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค๋‹ˆ๋‹ค."
772
+ ]
773
+ },
774
+ {
775
+ "cell_type": "code",
776
+ "execution_count": null,
777
+ "id": "5cdf8b98",
778
+ "metadata": {},
779
+ "outputs": [],
780
+ "source": [
781
+ "trainer.train()"
782
+ ]
783
+ },
784
+ {
785
+ "cell_type": "markdown",
786
+ "id": "515e7b86",
787
+ "metadata": {},
788
+ "source": [
789
+ "์•ž์„œ ์ดˆ๊ธฐํ™”ํ•œ trainer ๊ฐ์ฒด๋ฅผ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์„œ ๊ทธ๋Ÿฐ์ง€ ์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ์•ฝ 3์‹œ๊ฐ„ 30๋ถ„ ์ •๋„ ๊ฑธ๋ ธ๋„ค์š”."
790
+ ]
791
+ },
792
+ {
793
+ "cell_type": "code",
794
+ "execution_count": null,
795
+ "id": "c8a0e594",
796
+ "metadata": {},
797
+ "outputs": [],
798
+ "source": [
799
+ "trainer.evaluate()"
800
+ ]
801
+ },
802
+ {
803
+ "cell_type": "markdown",
804
+ "id": "0eba9443",
805
+ "metadata": {},
806
+ "source": [
807
+ "ํ•™์Šตํ•œ ๋ชจ๋ธ์„ evalutate()๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€๋ฅผ ํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•ด ๋ด…๋‹ˆ๋‹ค."
808
+ ]
809
+ },
810
+ {
811
+ "cell_type": "code",
812
+ "execution_count": null,
813
+ "id": "0f57d941",
814
+ "metadata": {},
815
+ "outputs": [],
816
+ "source": [
817
+ "trainer.save_model(\"./\")"
818
+ ]
819
+ },
820
+ {
821
+ "cell_type": "markdown",
822
+ "id": "69a124cf",
823
+ "metadata": {},
824
+ "source": [
825
+ "ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๋‚˜์ค‘์— ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ €์žฅ ๊ฒฝ๋กœ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ €๋Š” ํ˜„์žฌ ํด๋”์— ์ €์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.\n",
826
+ "\n",
827
+ "## ๋งˆ์น˜๋ฉฐ\n",
828
+ "\n",
829
+ "์ œ๊ฐ€ ์ธ๊ณต์ง€๋Šฅ์„ ์‹œ์ž‘ํ•œ์ง€ ์–ผ๋งˆ ๋˜์ง€ ์•Š์•„ ๊ฐœ๋…์ด ๋œ ์žกํ˜€์žˆ๊ณ , ์ •๋ง ์–ด๋ ค์›Œ ํ–ˆ์—ˆ๋Š”๋ฐ Huffon๋‹˜์˜ [github](https://github.com/Huffon)์— ์˜ฌ๋ผ๊ฐ€์žˆ๋Š” ๋…ธํŠธ๋ถ์„ ๋ณด๋ฉฐ ๋งŽ์ด ๋ฐฐ์šด ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ธํŠธ๋ถ์€ Huffon๋‹˜์˜ ์˜ˆ์ œ ์ฝ”๋“œ์™€ ์„ค๋ช…์„ ๋”ฐ๋ผ์นœ ๊ฒƒ์ด ๋Œ€๋ถ€๋ถ„์ด์ง€๋งŒ ๋ˆˆ์œผ๋กœ ๋ณด๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ ํ•ด๋ณด๋Š” ๊ฒƒ์˜ ์ฐจ์ด๊ฐ€ ์—„์ฒญ ํฌ๋‹ค๋Š” ๊ฒƒ์„ ๋Š๊ผˆ์Šต๋‹ˆ๋‹ค."
830
+ ]
831
+ }
832
+ ],
833
+ "metadata": {
834
+ "kernelspec": {
835
+ "display_name": "Python 3",
836
+ "language": "python",
837
+ "name": "python3"
838
+ },
839
+ "language_info": {
840
+ "codemirror_mode": {
841
+ "name": "ipython",
842
+ "version": 3
843
+ },
844
+ "file_extension": ".py",
845
+ "mimetype": "text/x-python",
846
+ "name": "python",
847
+ "nbconvert_exporter": "python",
848
+ "pygments_lexer": "ipython3",
849
+ "version": "3.7.9"
850
+ }
851
+ },
852
+ "nbformat": 4,
853
+ "nbformat_minor": 5
854
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:714a296e4cf0c0f584471b4291f7b8a2af77675c9a4f98feba067c674209ed78
3
+ size 442575305
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "bos_token": "[CLS]", "eos_token": "[SEP]", "model_max_length": 512, "special_tokens_map_file": "/workspace/.cache/huggingface/transformers/9d0c87e44b00acfbfbae931b2e4068eb6311a0c3e71e23e5400bdf57cab4bfbf.70c17d6e4d492c8f24f5bb97ab56c7f272e947112c6faf9dd846da42ba13eb23", "name_or_path": "klue/roberta-base", "tokenizer_class": "BertTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66f689c35735c17dcdcddb56ac4d319208e4c9f41b16b09e71fe2d3fdec197dc
3
+ size 2607
vocab.txt ADDED
The diff for this file is too large to render. See raw diff