File size: 6,446 Bytes
6f6ef66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 仮想環境の構築\n",
    "python ver 3.9.6  \n",
    "名称は「fine_tuning_with_clm」  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ソースからのHuggingface Transformersのインストール\n",
    "# !pip install requirements.txt\n",
    "# !pip install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117\n",
    "!git clone https://github.com/huggingface/transformers -b v4.28.1\n",
    "# !pip install -e transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -r ./transformers/examples/pytorch/language-modeling/requirements.txt"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "「./transformers/examples/language-modeling/run_clm.py」の編集\n",
    "\n",
    "```python\n",
    "#T5Tokenizerのインポートの追加\n",
    "from transformers import T5Tokenizer\n",
    "\n",
    "#AutoTokenizerをT5Tokenizerに変更\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)\n",
    "  ↓\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)\n",
    "  ↓\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### fine tuningの実行(jupyter notebook版)\n",
    "\n",
    "以下のコマンドを実行すれば、このJupyter notebook上でも実行できる。  \n",
    "ただし、出力がリアルタイムで更新されず、いまどのぐらい学習が終わったのかわからない。  \n",
    "そのため、後述の方法にて、コマンドプロンプトで実行する。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import os\n",
    "\n",
    "# os.environ['CUDA_VISIBLE_DEVICES'] = '0'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 280minぐらいかかっていたが、終わらなかった\n",
    "\n",
    "# !python ./transformers/examples/pytorch/language-modeling/run_clm.py \\\n",
    "#     --model_name_or_path=rinna/japanese-gpt-1b \\\n",
    "#     --train_file=./train_data/databricks-dolly-15k-ja.json \\\n",
    "#     --output_dir=output \\\n",
    "#     --do_train\\\n",
    "#     --bf16=True \\\n",
    "#     --tf32=True \\\n",
    "#     --optim=adafactor \\\n",
    "#     --num_train_epochs=18 \\\n",
    "#     --save_steps=384 \\\n",
    "#     --logging_steps=38 \\\n",
    "#     --learning_rate=1e-07 \\\n",
    "#     --lr_scheduler_type=constant \\\n",
    "#     --gradient_checkpointing \\\n",
    "#     --per_device_train_batch_size=1 \\\n",
    "#     --save_safetensors=True \\\n",
    "#     --logging_dir=logs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### fine tuningの実行(コマンドプロンプト版)\n",
    "\n",
    "前提条件として、fine_tunning_with_clm環境をactivateしておくこと。  \n",
    "コマンドプロンプトでの仮想環境起動は、Script/activateを実行すればよい。  \n",
    "\n",
    "※RTX4070で実行してみたところ、学習時間は10時間半だった。  \n",
    "\n",
    "```cmd\n",
    "# 以下はコマンドプロンプトにて実行すること\n",
    "\n",
    "# RTX4070のみ使用するように指定\n",
    "set CUDA_VISIBLE_DEVICES=0\n",
    "\n",
    "python ./transformers/examples/pytorch/language-modeling/run_clm.py ^\n",
    "    --model_name_or_path=rinna/japanese-gpt-1b ^\n",
    "    --train_file=./train_data/databricks-dolly-15k-ja.txt ^\n",
    "    --output_dir=output ^\n",
    "    --do_train^\n",
    "    --bf16=True ^\n",
    "    --tf32=True ^\n",
    "    --optim=adafactor ^\n",
    "    --num_train_epochs=18 ^\n",
    "    --save_steps=384 ^\n",
    "    --logging_steps=38 ^\n",
    "    --learning_rate=1e-07 ^\n",
    "    --lr_scheduler_type=constant ^\n",
    "    --gradient_checkpointing ^\n",
    "    --per_device_train_batch_size=2 ^\n",
    "    --save_safetensors=True ^\n",
    "    --logging_dir=logs\n",
    "```\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考サイト\n",
    "\n",
    "|サイト|何を参考にしたか|備考|  \n",
    "|--|--|--|  \n",
    "|[inu-ai/dolly-japanese-gpt-1b](https://huggingface.co/inu-ai/dolly-japanese-gpt-1b)|学習のハイパーパラメータ||  \n",
    "|[Datasets:kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja)|学習に使用したデータ||  \n",
    "|[スプラのブキ紹介文を自動生成してみた(GPT)](https://zenn.dev/thr3a/articles/eed434cb20339a)|fine tuning環境構築と実行方法||  \n",
    "|[Huggingface Transformers 入門 (28) - rinnaの日本語GPT-2モデルのファインチューニング](https://note.com/npaka/n/n8a435f0c8f69)|fine tuning環境構築と実行方法||  \n",
    "|[GPT-2をファインチューニングしてニュース記事のタイトルを条件付きで生成してみた。](https://qiita.com/m__k/items/36875fedf8ad1842b729)|(参考)fine tuning環境構築と実行方法||\n",
    "|[Google Colab Proが日本から利用可能に](https://webbigdata.jp/post-9927/)|(参考)||  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fine_tuning_with_clm",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}