thefrigidliquidation commited on
Commit
9ab5bc1
โ€ข
1 Parent(s): d2b7ec2

Add colab notebook

Browse files
Files changed (1) hide show
  1. Bookworm_MTL.ipynb +265 -0
Bookworm_MTL.ipynb ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "name": "Bookworm MTL.ipynb",
7
+ "provenance": [],
8
+ "collapsed_sections": []
9
+ },
10
+ "kernelspec": {
11
+ "name": "python3",
12
+ "display_name": "Python 3"
13
+ },
14
+ "language_info": {
15
+ "name": "python"
16
+ },
17
+ "accelerator": "GPU",
18
+ "gpuClass": "standard"
19
+ },
20
+ "cells": [
21
+ {
22
+ "cell_type": "markdown",
23
+ "source": [
24
+ "# Ascendance of a Bookworm MTL\n",
25
+ "\n",
26
+ "This notebook uses a custom machine translation model to translate the Ascendance of a Bookworm WN into English.\n",
27
+ "\n",
28
+ "This model is in BETA. Pronouns are not fixed yet, new characters' names may be wrong, and sentence splitting isn't implemented yet, so the model likes making a single long sentence. These issues will be fixed in the future.\n",
29
+ "\n",
30
+ "If you encounter any poorly translated sentences and want to help improve the model, see the note at the bottom of the page.\n",
31
+ "\n",
32
+ "To run this notebook, make sure you are using a GPU runtime and then go to\n",
33
+ "Runtime > Run all. Once that is done, you can change the text in the translation cell and run it multiple times by clicking the run button to the left of the cell. "
34
+ ],
35
+ "metadata": {
36
+ "id": "nkp0dv1zg93C"
37
+ }
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "source": [
42
+ "#@title Run this to set up the environment\n",
43
+ "\n",
44
+ "!pip install transformers\n",
45
+ "!pip install accelerate\n",
46
+ "!pip install unidecode\n",
47
+ "!pip install spacy\n",
48
+ "!python -m spacy download ja_core_news_lg"
49
+ ],
50
+ "metadata": {
51
+ "cellView": "form",
52
+ "id": "nM7cmpX4hl0q"
53
+ },
54
+ "execution_count": null,
55
+ "outputs": []
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "source": [
60
+ "#@title Run this to import python packages\n",
61
+ "\n",
62
+ "from functools import partial\n",
63
+ "import torch\n",
64
+ "from torch.cuda.amp import autocast\n",
65
+ "from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM, NllbTokenizerFast\n",
66
+ "import spacy\n",
67
+ "from tqdm.notebook import tqdm\n",
68
+ "import re\n",
69
+ "import unidecode\n",
70
+ "import unicodedata"
71
+ ],
72
+ "metadata": {
73
+ "cellView": "form",
74
+ "id": "mSnruJt8r3qP"
75
+ },
76
+ "execution_count": null,
77
+ "outputs": []
78
+ },
79
+ {
80
+ "cell_type": "code",
81
+ "source": [
82
+ "#@title Run this to set the output language\n",
83
+ "#@markdown This model is multi-lingual! Here you can set the output language.\n",
84
+ "#@markdown It is best with English, but it can translate into other\n",
85
+ "#@markdown languages too. A couple are listed here, but you can enter a different\n",
86
+ "#@markdown one if you want. See pages 13-16 in [this pdf](https://arxiv.org/pdf/2207.04672.pdf)\n",
87
+ "#@markdown for a full list of supported languages.\n",
88
+ "\n",
89
+ "target_language = 'eng_Latn' #@param [\"eng_Latn\", \"spa_Latn\", \"fra_Latn\", \"deu_Latn\"] {allow-input: true}"
90
+ ],
91
+ "metadata": {
92
+ "cellView": "form",
93
+ "id": "6w_HfApfhn9j"
94
+ },
95
+ "execution_count": null,
96
+ "outputs": []
97
+ },
98
+ {
99
+ "cell_type": "code",
100
+ "source": [
101
+ "#@title Run this to initialize the model\n",
102
+ "\n",
103
+ "DEVICE = 'cuda:0'\n",
104
+ "model_checkpoint = \"thefrigidliquidation/nllb-200-distilled-1.3B-bookworm\"\n",
105
+ "\n",
106
+ "config = AutoConfig.from_pretrained(model_checkpoint)\n",
107
+ "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, src_lang=\"jpn_Jpan\", tgt_lang=target_language)\n",
108
+ "model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16).to(DEVICE)\n",
109
+ "\n",
110
+ "nlp_ja = spacy.load('ja_core_news_lg')"
111
+ ],
112
+ "metadata": {
113
+ "cellView": "form",
114
+ "id": "cGnkjUgej6Uv"
115
+ },
116
+ "execution_count": null,
117
+ "outputs": []
118
+ },
119
+ {
120
+ "cell_type": "code",
121
+ "source": [
122
+ "#@title Run this to set up the code to do the translating\n",
123
+ "\n",
124
+ "DOTS_REGEX = re.compile(r\"^(?P<dots>[.โ€ฆ]+)ใ€‚?$\")\n",
125
+ "\n",
126
+ "\n",
127
+ "def char_filter(string):\n",
128
+ " latin = re.compile('[a-zA-Z]+')\n",
129
+ " for char in unicodedata.normalize('NFC', string):\n",
130
+ " decoded = unidecode.unidecode(char)\n",
131
+ " if latin.match(decoded):\n",
132
+ " yield char\n",
133
+ " else:\n",
134
+ " yield decoded\n",
135
+ "\n",
136
+ "\n",
137
+ "def clean_string(string):\n",
138
+ " s = \"\".join(char_filter(string))\n",
139
+ " s = \"\\n\".join((x.rstrip() for x in s.splitlines()))\n",
140
+ " return s\n",
141
+ "\n",
142
+ "\n",
143
+ "def split_lglines_sentences(nlp, text, split_on_len=200):\n",
144
+ " lines = text.splitlines()\n",
145
+ " for line in lines:\n",
146
+ " if len(line) < split_on_len:\n",
147
+ " yield line.strip()\n",
148
+ " continue\n",
149
+ " doc = nlp(line)\n",
150
+ " assert doc.has_annotation(\"SENT_START\")\n",
151
+ " spacy_sents = [str(x).strip() for x in doc.sents]\n",
152
+ " if len(spacy_sents) == 1:\n",
153
+ " yield spacy_sents[0]\n",
154
+ " continue\n",
155
+ " # japanese spacy is bad. combine again if needed\n",
156
+ " sents = []\n",
157
+ " for sent in spacy_sents:\n",
158
+ " if (len(sent) < 4) and (len(sents) > 0) and (len(sents[-1]) == 0 or sents[-1][-1] != '.'):\n",
159
+ " sents[-1] += sent\n",
160
+ " else:\n",
161
+ " sents.append(sent)\n",
162
+ " yield from (x for x in sents if not DOTS_REGEX.match(x))\n",
163
+ "\n",
164
+ "\n",
165
+ "def translate_m2m(translator, tokenizer: NllbTokenizerFast, device, pars, verbose: bool = False):\n",
166
+ " en_pars = []\n",
167
+ " pars_it = tqdm(pars, leave=False, smoothing=0.0) if verbose else pars\n",
168
+ " for line in pars_it:\n",
169
+ " if line.strip() == \"\":\n",
170
+ " en_pars.append(\"\")\n",
171
+ " continue\n",
172
+ " inputs = tokenizer(f\"{line}\", return_tensors=\"pt\")\n",
173
+ " inputs = {k: v.to(device) for (k, v) in inputs.items()}\n",
174
+ " generated_tokens = translator.generate(\n",
175
+ " **inputs,\n",
176
+ " forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],\n",
177
+ " max_new_tokens=512,\n",
178
+ " no_repeat_ngram_size=4,\n",
179
+ " ).cpu()\n",
180
+ " with tokenizer.as_target_tokenizer():\n",
181
+ " outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)\n",
182
+ " en_pars.append(*outputs)\n",
183
+ " return en_pars\n",
184
+ "\n",
185
+ "\n",
186
+ "translate = partial(translate_m2m, model, tokenizer, DEVICE)\n",
187
+ "\n",
188
+ "\n",
189
+ "def translate_long_text(text: str):\n",
190
+ " lines = split_lglines_sentences(nlp_ja, text, split_on_len=150)\n",
191
+ " with torch.no_grad():\n",
192
+ " with autocast(dtype=torch.float16):\n",
193
+ " en_lines = translate([clean_string(x).strip() for x in lines], verbose=True)\n",
194
+ " for en_line in en_lines:\n",
195
+ " print(en_line)"
196
+ ],
197
+ "metadata": {
198
+ "cellView": "form",
199
+ "id": "zPFc9VP0k4_y"
200
+ },
201
+ "execution_count": null,
202
+ "outputs": []
203
+ },
204
+ {
205
+ "cell_type": "code",
206
+ "source": [
207
+ "#@title Run this to translate the text\n",
208
+ "\n",
209
+ "#@markdown Enter the Japansese text into the box on the left between the three quation marks (\"\"\").\n",
210
+ "#@markdown Make sure there is no text on the lines containing the three quotes.\n",
211
+ "#@markdown See the example text for an idea of the formatting required.\n",
212
+ "\n",
213
+ "text = \"\"\"\n",
214
+ "ๆœฌ้ ˆใ‚‚ใจใ™้บ—ไนƒใ†ใ‚‰ใฎใฏๆœฌใŒๅฅฝใใ ใ€‚\n",
215
+ "\n",
216
+ "ๅฟƒ็†ๅญฆใ€ๅฎ—ๆ•™ใ€ๆญดๅฒใ€ๅœฐ็†ใ€ๆ•™่‚ฒๅญฆใ€ๆฐ‘ไฟ—ๅญฆใ€ๆ•ฐๅญฆใ€็‰ฉ็†ใ€ๅœฐๅญฆใ€ๅŒ–ๅญฆใ€็”Ÿ็‰ฉๅญฆใ€่Šธ่ก“ใ€ไฝ“่‚ฒใ€่จ€่ชžใ€็‰ฉ่ชžโ€ฆโ€ฆไบบ้กžใฎ็Ÿฅ่ญ˜ใŒใŽใฃใกใ‚Š่ฉฐใ‚่พผใพใ‚ŒใŸๆœฌใ‚’ๅฟƒใฎๅบ•ใ‹ใ‚‰ๆ„›ใ—ใฆใ„ใ‚‹ใ€‚\n",
217
+ "\n",
218
+ "ๆง˜ใ€…ใช็Ÿฅ่ญ˜ใŒไธ€ๅ†Šใซใพใจใ‚ใ‚‰ใ‚Œใฆใ„ใ‚‹ๆœฌใ‚’่ชญใ‚€ใจใ€ใจใฆใ‚‚ๅพ—ใ‚’ใ—ใŸๆฐ—ๅˆ†ใซใชใ‚Œใ‚‹ใ—ใ€่‡ชๅˆ†ใŒใ“ใฎ็›ฎใง่ฆ‹ใŸใ“ใจใŒใชใ„ไธ–็•Œใ‚’ใ€ๆœฌๅฑ‹ใ‚„ๅ›ณๆ›ธ้คจใซไธฆใถๅ†™็œŸ้›†ใ‚’้€šใ—ใฆ่ฆ‹ใ‚‹ใฎใ‚‚ใ€ไธ–็•ŒใŒๅบƒใŒใฃใฆใ„ใใ‚ˆใ†ใง้™ถ้…”ใงใใ‚‹ใ€‚\n",
219
+ "\n",
220
+ "ๅค–ๅ›ฝใฎๅคใ„็‰ฉ่ชžใ ใฃใฆใ€้•ใ†ๆ™‚ไปฃใฎใ€้•ใ†ๅ›ฝใฎ้ขจ็ฟ’ใŒๅžฃ้–“่ฆ‹ใˆใฆ่ถฃๆทฑใ„ใ—ใ€ใ‚ใ‚‰ใ‚†ใ‚‹ๅˆ†้‡ŽใซใŠใ„ใฆๆญดๅฒใŒใ‚ใ‚Šใ€ใใ‚Œใ‚’็ด่งฃใ„ใฆใ„ใ‘ใฐใ€ๆ™‚้–“ใ‚’ๅฟ˜ใ‚Œใ‚‹ใชใ‚“ใฆใ„ใคใ‚‚ใฎใ“ใจใงใ‚ใ‚‹ใ€‚\n",
221
+ "\n",
222
+ "้บ—ไนƒใฏใ€ๅ›ณๆ›ธ้คจใฎๅคใ„ๆœฌใŒ้›†ใ‚ใ‚‰ใ‚Œใฆใ„ใ‚‹ๆ›ธๅบซใฎใ€ๅคใ„ๆœฌ็‹ฌ็‰นใฎๅฐ‘ใ€…้ปดใ‹ใณ่‡ญใ„ๅŒ‚ใ„ใ‚„ๅŸƒใฃใฝใ„ๅŒ‚ใ„ใŒๅฅฝใใงใ€ๅ›ณๆ›ธ้คจใซ่กŒใใจใ‚ใ–ใ‚ใ–ๆ›ธๅบซใซๅ…ฅใ‚Š่พผใ‚€ใ€‚ใใ“ใงใ‚†ใฃใใ‚Šใจๅคใ„ๅŒ‚ใ„ใฎใ™ใ‚‹็ฉบๆฐ—ใ‚’ๅธใ„่พผใฟใ€ๅนดใ‚’็ตŒใŸๆœฌใ‚’่ฆ‹ๅ›žใ›ใฐใ€้บ—ไนƒใฏใใ‚Œใ ใ‘ใงๅฌ‰ใ—ใใชใฃใฆใ€่ˆˆๅฅฎใ—ใฆใ—ใพใ†ใ€‚\n",
223
+ "\"\"\"[1:-1]\n",
224
+ "\n",
225
+ "translate_long_text(text)"
226
+ ],
227
+ "metadata": {
228
+ "id": "Rwv_rO9plAsj"
229
+ },
230
+ "execution_count": null,
231
+ "outputs": []
232
+ },
233
+ {
234
+ "cell_type": "code",
235
+ "source": [
236
+ "#@title Submit corrected sentences to improve the model!\n",
237
+ "#@markdown If you encounter poorly translated sentences with the wrong name or term, please correct it!\n",
238
+ "#@markdown You can use other translation sites (like [DeepL](https://www.deepl.com/translator))\n",
239
+ "#@markdown to make sure the Japanese and English sentences match.\n",
240
+ "\n",
241
+ "#@markdown Then run this cell and message [u/thefrigidliquidation](https://www.reddit.com/user/thefrigidliquidation/)\n",
242
+ "#@markdown on reddit with this cells output.\n",
243
+ "\n",
244
+ "import base64\n",
245
+ "import json\n",
246
+ "\n",
247
+ "\n",
248
+ "\n",
249
+ "ja_sent = 'The Japanese sentence.' #@param {type:\"string\"}\n",
250
+ "en_sent = 'The corrected English sentence.' #@param {type:\"string\"}\n",
251
+ "\n",
252
+ "df = {'translation': {'en': en_sent, 'ja': ja_sent}}\n",
253
+ "df_json = json.dumps(df)\n",
254
+ "\n",
255
+ "print(base64.b64encode(df_json.encode('ascii')).decode('ascii'))\n"
256
+ ],
257
+ "metadata": {
258
+ "cellView": "form",
259
+ "id": "0yx9hnj6yBKA"
260
+ },
261
+ "execution_count": null,
262
+ "outputs": []
263
+ }
264
+ ]
265
+ }