File size: 8,040 Bytes
83f9751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca90249
 
 
83f9751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca90249
83f9751
 
 
 
 
 
 
 
ca90249
 
 
83f9751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "68c06a52-e27c-4da6-8a02-cd010270bedf",
   "metadata": {},
   "source": [
    "# 3 datasets库基本使用"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dc4c70f-694c-4785-81d8-26ebab2b7210",
   "metadata": {},
   "source": [
    "## 基本使用\n",
    "上一节中,已经介绍了使用datasets读取本地文件的方法,这一节继续介绍datasets一些常用的方法\n",
    "\n",
    "首先是数据分割,因为我们从数据源获得DNA序列等数据,都是一个文本文件,但训练的时候,一般都需要分成训练集和测试集等\n",
    "\n",
    "一个简单的例子如下所示:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6e9f346f-31f6-40cc-86e5-723c65033883",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 1025615\n",
       "    })\n",
       "    test: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 53980\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#读取dna数据\n",
    "from datasets import load_dataset\n",
    "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')['train'].rain_test_split(test_size=0.05) #默认已经shuffle\n",
    "dna_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "75900650-74da-4ca9-a285-b2832a5a1485",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'text': 'ATGTGTGCAATGGGTTATCTTTATGTAATAACAGTCATATCACGGGTGTTCCTCAGAAGTAGTGAACTGGCTAGCATTTTTAGACACTATGTGATCTCTCATATGACTACACTCAATTTAAAATAAAATGAAATGTGTTGTGTGTGTCTAAAATCTATAAAGGGAAAAGTATCTTAAGTATTTTTTAGATGTTAAAGTAGATGTGTATCCTAAAATATGCATTGTTCACAGATGTTAAAATTACAACTACAATCTGTGAAACACAGATCTTAGGACAGCAATGTTTCACAAGAAAAAAAATGATGCAGCCTTCTTTAGTATTTATAGTCATTTGAACAATTATGGCAACCATAAGTTCATATATAACATCCCCATTTGGTGAAACTAGTTGGGAAAGATTAGAAGGTATGACCTTGTTGGAGGAACTATACCATTGGGGTGGCTTTGAGACTTCAGAAGTTTCAAGGCCCATTTAGTGCTTTCTACCTTATGAAGCTGTGAGTTCTCCTTGCTAGCTACATAACTTGGAAAGCAGGCCCTGCACTTCACCCAAGGAGCACATTAGAGCTGGCCCTTTTGGAAGGCAATTGCGTAAGCCACACCAGGGCACCAGAGATCTGGCACTGCCATGCTCCTGCTTGCAAGTAGTGGTGTGGGTGTTGGGTGATGCCCTCCAGTCCCACCTTTTGCCACCTGTAGTAGTCAGGGGAGTTGGCCTAAGGGCATGAGAGCCTAAGACTTCACCCTAATCCCTCACCAACTGTAGCATGTGGAAGAGCAGGCTCTGTACCTTCCCTGGGCAACACATTGGAGCTGGCCCCTCACAGGCTGCAGGACTTGGGAGAGTGAGTGCTGCACCTTGACTGTGAAGGTGGTTTTGGAGGTGTGGGTGTGAGACCATGAGACCAAGAGAGGAATGGAATATTACTCACTTATTAAAAACAATGACTTCATGAAATTTGCAGGCAAATGGATGGAACTTGAAAATATCCTGAGTGAG'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_dataset[\"test\"][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdcc5404-6590-47a4-be2c-2c1d35d3bae4",
   "metadata": {},
   "source": [
    "可以看到,数据集已经分割成了train和test两个数据集,而在分割的时候,已经进行的随机处理\n",
    "\n",
    "当然,如果数据集过大,我们只需要其中一部分,这个也是一个常见的需求,一般可以使用 Dataset.select() 函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "049ad194-cb60-4b0f-8554-1915bfc7a9cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 50000\n",
       "    })\n",
       "    valid: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 500\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datasets import load_dataset, DatasetDict\n",
    "\n",
    "dna_dataset_sample =  DatasetDict(\n",
    "    {\n",
    "        \"train\": dna_dataset[\"train\"].shuffle().select(range(50000)), \n",
    "        \"valid\": dna_dataset[\"test\"].shuffle().select(range(500)),\n",
    "        \"evla\": dna_dataset[\"test\"].shuffle().select(range(500))\n",
    "\n",
    "    }\n",
    ")\n",
    "dna_dataset_sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50cceda3-36ca-4fa6-bfb5-1dbeb155fe4f",
   "metadata": {},
   "source": [
    "可以看到,我们使用DatasetDict来直接构造datasets,先使用shuffle()来随机,然后使用select来选择前n个数据\n",
    "\n",
    "select的参数为indices (list 或 range): 索引列表或范围对象,指明要选择哪些样本,如dataset.select([0, 2, 4])就是选择1,3,5条记录"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "17a1fa7c-ff4b-419f-8a82-e58cc5777cd4",
   "metadata": {},
   "source": [
    "## 读取线上库\n",
    "\n",
    "当然,数据也可以直接从huggingface的线上仓库读取,这时候需要注意科学上网问题。\n",
    "\n",
    "具体使用函数也是load_dataset\n",
    "\n",
    "<img src='img/datasets_dnagpt.png' width='800px' />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6ae24950-2c74-457b-b1f2-d2e4397e1fa1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"\\nimport os\\n\\n# 设置环境变量, autodl专区 其他idc\\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\\n\\n# 打印环境变量以确认设置成功\\nprint(os.environ.get('HF_ENDPOINT'))\\n\""
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import subprocess\n",
    "import os\n",
    "# 设置环境变量, autodl一般区域\n",
    "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
    "output = result.stdout\n",
    "for line in output.splitlines():\n",
    "    if '=' in line:\n",
    "        var, value = line.split('=', 1)\n",
    "        os.environ[var] = value\n",
    "\n",
    "#或者\n",
    "\"\"\"\n",
    "import os\n",
    "\n",
    "# 设置环境变量, autodl专区 其他idc\n",
    "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
    "\n",
    "# 打印环境变量以确认设置成功\n",
    "print(os.environ.get('HF_ENDPOINT'))\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "30ff9798-d06d-4992-81fc-03102f03599b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['sequence', 'label'],\n",
       "        num_rows: 59196\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "dna_data = load_dataset(\"dnagpt/dna_core_promoter\")\n",
    "dna_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30c4b754-af11-4ac1-9742-45427059617e",
   "metadata": {},
   "source": [
    "当然,如果你想分享你的数据集到huggingface上面,也是一行函数即可:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9847be9-e085-41e3-ad29-a450cc017d64",
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_data.push_to_hub(\"org_name/your_dataset_name\", token=\"hf_yourtoken\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}