See examples in [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "l929HimrxS0a" }, "source": [ "## Retrieve your data & make a parallel corpus\n", "\n", "If you are wanting to use the JW300 data referenced on the Masakhane website or in our GitHub repo, you can use `opus-tools` to convert the data into a convenient format. `opus_read` from that package provides a convenient tool for reading the native aligned XML files and to convert them to TMX format. The tool can also be used to fetch relevant files from OPUS on the fly and to filter the data as necessary. [Read the documentation](https://pypi.org/project/opustools-pkg/) for more details.\n", "\n", "Once you have your corpus files in TMX format (an xml structure which will include the sentences in your target language and your source language in a single file), we recommend reading them into a pandas dataframe. Thankfully, Jade wrote a silly `tmx2dataframe` package which converts your tmx file to a pandas dataframe. " ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "oGRmDELn7Az0", "outputId": "56d0cd12-61f6-4a4b-d2b8-74abfff2f815", "colab": { "base_uri": "https://localhost:8080/", "height": 127 } }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "text": [ "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n", "\n", "Enter your authorization code:\n", "··········\n", "Mounted at /content/drive\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "Cn3tgQLzUxwn", "colab": {} }, "source": [ "# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:\n", "# These will also become the suffix's of all vocab and corpus files used throughout\n", "import os\n", "source_language = \"yo\"\n", "target_language = \"en\"\n", "lc = False # If True, lowercase the data.\n", "seed = 42 # Random seed for shuffling.\n", "tag = \"baseline\" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted\n", "\n", "os.environ[\"src\"] = source_language # Sets them in bash as well, since we often use bash scripts\n", "os.environ[\"tgt\"] = target_language\n", "os.environ[\"tag\"] = tag\n", "\n", "# This will save it to a folder in our gdrive instead! \n", "!mkdir -p \"/content/drive/My Drive/masakhane/$src-$tgt-$tag\"\n", "g_drive_path = \"/content/drive/My Drive/masakhane/%s-%s-%s\" % (source_language, target_language, tag)\n", "os.environ[\"gdrive_path\"] = g_drive_path\n", "models_path = '%s/models/%s%s_transformer'% (g_drive_path, source_language, target_language)\n", "# model temporary directory for training\n", "model_temp_dir = \"/content/drive/My Drive/masakhane/model-temp\"\n", "# model permanent storage on the drive\n", "!mkdir -p \"$gdrive_path/models/${src}${tgt}_transformer/\"" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "kBSgJHEw7Nvx", "outputId": "6e14d6cf-9290-4fc3-cb32-3deb7224255a", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "!echo $gdrive_path" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/content/drive/My Drive/masakhane/yo-en-baseline\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "gA75Fs9ys8Y9", "outputId": "83fb02a5-bd6b-4b83-b155-078ae07ded95", "colab": { "base_uri": "https://localhost:8080/", "height": 102 } }, "source": [ "#TODO: Skip for retrain\n", "# Install opus-tools\n", "! pip install opustools-pkg " ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Collecting opustools-pkg\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/6c/9f/e829a0cceccc603450cd18e1ff80807b6237a88d9a8df2c0bb320796e900/opustools_pkg-0.0.52-py3-none-any.whl (80kB)\n", "\u001b[K |████████████████████████████████| 81kB 9.5MB/s \n", "\u001b[?25hInstalling collected packages: opustools-pkg\n", "Successfully installed opustools-pkg-0.0.52\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "xq-tDZVks7ZD", "outputId": "12ccab59-4d5d-4e28-8163-4a39fe17f233", "colab": { "base_uri": "https://localhost:8080/", "height": 221 } }, "source": [ "#TODO: Skip for retrain\n", "# Downloading our corpus\n", "! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q\n", "\n", "# extract the corpus file\n", "! gunzip JW300_latest_xml_$src-$tgt.xml.gz" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "\n", "Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-yo.xml.gz not found. The following files are available for downloading:\n", "\n", " 4 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-yo.xml.gz\n", " 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip\n", " 58 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/yo.zip\n", "\n", " 325 MB Total size\n", "./JW300_latest_xml_en-yo.xml.gz ... 100% of 4 MB\n", "./JW300_latest_xml_en.zip ... 100% of 263 MB\n", "./JW300_latest_xml_yo.zip ... 100% of 58 MB\n", "gzip: JW300_latest_xml_yo-en.xml.gz: No such file or directory\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "doraUei7W31h", "colab_type": "code", "colab": {} }, "source": [ "# extract the corpus file\n", "! gunzip JW300_latest_xml_$tgt-$src.xml.gz" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "n48GDRnP8y2G", "colab_type": "code", "outputId": "4f30b67f-6d04-4dd9-fa2a-859eb5a076b2", "colab": { "base_uri": "https://localhost:8080/", "height": 578 } }, "source": [ "#TODO: Skip for retrain\n", "# Download the global test set.\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en\n", " \n", "# And the specific test set for this language pair.\n", "os.environ[\"trg\"] = target_language \n", "os.environ[\"src\"] = source_language \n", "\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$src.en \n", "! mv test.en-$src.en test.en\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$src.$src \n", "! mv test.en-$src.$src test.$src" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "--2020-04-07 20:43:18-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)...,,, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)||:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 277791 (271K) [text/plain]\n", "Saving to: ‘test.en-any.en’\n", "\n", "\rtest.en-any.en 0%[ ] 0 --.-KB/s \rtest.en-any.en 100%[===================>] 271.28K --.-KB/s in 0.008s \n", "\n", "2020-04-07 20:43:19 (32.5 MB/s) - ‘test.en-any.en’ saved [277791/277791]\n", "\n", "--2020-04-07 20:43:22-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-yo.en\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)...,,, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)||:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 201994 (197K) [text/plain]\n", "Saving to: ‘test.en-yo.en’\n", "\n", "test.en-yo.en 100%[===================>] 197.26K --.-KB/s in 0.005s \n", "\n", "2020-04-07 20:43:22 (36.5 MB/s) - ‘test.en-yo.en’ saved [201994/201994]\n", "\n", "--2020-04-07 20:43:27-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-yo.yo\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)...,,, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)||:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 280073 (274K) [text/plain]\n", "Saving to: ‘test.en-yo.yo’\n", "\n", "test.en-yo.yo 100%[===================>] 273.51K --.-KB/s in 0.007s \n", "\n", "2020-04-07 20:43:29 (36.3 MB/s) - ‘test.en-yo.yo’ saved [280073/280073]\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "NqDG-CI28y2L", "colab_type": "code", "outputId": "6a52eb39-aff5-41b8-d427-aa128cf73e76", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "#TODO: Skip for retrain\n", "# Read the test data to filter from train and dev splits.\n", "# Store english portion in set for quick filtering checks.\n", "en_test_sents = set()\n", "filter_test_sents = \"test.en-any.en\"\n", "j = 0\n", "with open(filter_test_sents) as f:\n", " for line in f:\n", " en_test_sents.add(line.strip())\n", " j += 1\n", "print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Loaded 3571 global test sentences to filter from the training/dev data.\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3CNdwLBCfSIl", "outputId": "c7e16fbb-ba42-41b2-b73b-49df100de222", "colab": { "base_uri": "https://localhost:8080/", "height": 159 } }, "source": [ "#TODO: Skip for retrain\n", "import pandas as pd\n", "\n", "# TMX file to dataframe\n", "source_file = 'jw300.' + source_language\n", "target_file = 'jw300.' + target_language\n", "\n", "source = []\n", "target = []\n", "skip_lines = [] # Collect the line numbers of the source portion to skip the same lines for the target portion.\n", "with open(source_file) as f:\n", " for i, line in enumerate(f):\n", " # Skip sentences that are contained in the test set.\n", " if line.strip() not in en_test_sents:\n", " source.append(line.strip())\n", " else:\n", " skip_lines.append(i) \n", "with open(target_file) as f:\n", " for j, line in enumerate(f):\n", " # Only add to corpus if corresponding source was not skipped.\n", " if j not in skip_lines:\n", " target.append(line.strip())\n", " \n", "print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))\n", " \n", "df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])\n", "# if you get TypeError: data argument can't be an iterator is because of your zip version run this below\n", "#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])\n", "df.head(3)" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Loaded data and skipped 1025/474986 lines since contained in test set.\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0Lílo Àkàbà — Ǹjẹ́ O Máa Ń Ṣe Àyẹ̀wò Wọ̀nyí Tó...Using Ladders — Do You Make These Safety Checks ?
1Látọwọ́ akọ̀ròyìn Jí !By Awake !
2ní Irelandcorrespondent in Ireland
\n", "
" ], "text/plain": [ " source_sentence target_sentence\n", "0 Lílo Àkàbà — Ǹjẹ́ O Máa Ń Ṣe Àyẹ̀wò Wọ̀nyí Tó... Using Ladders — Do You Make These Safety Checks ?\n", "1 Látọwọ́ akọ̀ròyìn Jí ! By Awake !\n", "2 ní Ireland correspondent in Ireland" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "YkuK3B4p2AkN" }, "source": [ "## Pre-processing and export\n", "\n", "It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.\n", "\n", "In addition we will split our data into dev/test/train and export to the filesystem." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "M_2ouEOH1_1q", "outputId": "06f01841-5cdf-4103-bc12-fb7bce7bc4d9", "colab": { "base_uri": "https://localhost:8080/", "height": 187 } }, "source": [ "#TODO: Skip for retrain\n", "# drop duplicate translations\n", "df_pp = df.drop_duplicates()\n", "\n", "# drop conflicting translations\n", "# (this is optional and something that you might want to comment out \n", "# depending on the size of your corpus)\n", "df_pp.drop_duplicates(subset='source_sentence', inplace=True)\n", "df_pp.drop_duplicates(subset='target_sentence', inplace=True)\n", "\n", "# Shuffle the data to remove bias in dev set selection.\n", "df_pp = df_pp.sample(frac=1, random_state=seed).reset_index(drop=True)" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \n", "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " import sys\n" ], "name": "stderr" } ] }, { "cell_type": "code", "metadata": { "id": "Z_1BwAApEtMk", "colab_type": "code", "outputId": "bf6ad039-dd32-4f7b-9441-cf00e769325c", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 } }, "source": [ "#TODO: Skip for retrain\n", "# Install fuzzy wuzzy to remove \"almost duplicate\" sentences in the\n", "# test and training sets.\n", "! pip install fuzzywuzzy\n", "! pip install python-Levenshtein\n", "import time\n", "from fuzzywuzzy import process\n", "import numpy as np\n", "\n", "# reset the index of the training set after previous filtering\n", "df_pp.reset_index(drop=False, inplace=True)\n", "\n", "# Remove samples from the training data set if they \"almost overlap\" with the\n", "# samples in the test set.\n", "\n", "# Filtering function. Adjust pad to narrow down the candidate matches to\n", "# within a certain length of characters of the given sample.\n", "def fuzzfilter(sample, candidates, pad):\n", " candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] \n", " if len(candidates) > 0:\n", " return process.extractOne(sample, candidates)[1]\n", " else:\n", " return np.nan\n", "\n", "# NOTE - This might run slow depending on the size of your training set. We are\n", "# printing some information to help you track how long it would take. \n", "scores = []\n", "start_time = time.time()\n", "for idx, row in df_pp.iterrows():\n", " scores.append(fuzzfilter(row['source_sentence'], list(en_test_sents), 5))\n", " if idx % 1000 == 0:\n", " hours, rem = divmod(time.time() - start_time, 3600)\n", " minutes, seconds = divmod(rem, 60)\n", " print(\"{:0>2}:{:0>2}:{:05.2f}\".format(int(hours),int(minutes),seconds), \"%0.2f percent complete\" % (100.0*float(idx)/float(len(df_pp))))\n", "\n", "# Filter out \"almost overlapping samples\"\n", "df_pp['scores'] = scores\n", "df_pp = df_pp[df_pp['scores'] < 95]" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Collecting fuzzywuzzy\n", " Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl\n", "Installing collected packages: fuzzywuzzy\n", "Successfully installed fuzzywuzzy-0.18.0\n", "Collecting python-Levenshtein\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)\n", "\u001b[K |████████████████████████████████| 51kB 8.7MB/s \n", "\u001b[?25hRequirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from python-Levenshtein) (46.1.3)\n", "Building wheels for collected packages: python-Levenshtein\n", " Building wheel for python-Levenshtein (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.0-cp36-cp36m-linux_x86_64.whl size=144794 sha256=0233a2657a58078318fa67995ba97920d36a754bc5c5ae6bd1044938e414d250\n", " Stored in directory: /root/.cache/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342\n", "Successfully built python-Levenshtein\n", "Installing collected packages: python-Levenshtein\n", "Successfully installed python-Levenshtein-0.12.0\n", "00:00:00.16 0.00 percent complete\n", "00:00:23.21 0.24 percent complete\n", "00:00:46.27 0.48 percent complete\n", "00:01:09.18 0.71 percent complete\n", "00:01:31.34 0.95 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:01:54.26 1.19 percent complete\n", "00:02:17.83 1.43 percent complete\n", "00:02:40.41 1.66 percent complete\n", "00:03:03.84 1.90 percent complete\n", "00:03:26.18 2.14 percent complete\n", "00:03:49.14 2.38 percent complete\n", "00:04:11.64 2.61 percent complete\n", "00:04:34.39 2.85 percent complete\n", "00:04:58.78 3.09 percent complete\n", "00:05:20.46 3.33 percent complete\n", "00:05:43.36 3.56 percent complete\n", "00:06:06.61 3.80 percent complete\n", "00:06:28.24 4.04 percent complete\n", "00:06:50.05 4.28 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↑ ↑']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:07:14.21 4.51 percent complete\n", "00:07:36.24 4.75 percent complete\n", "00:07:58.82 4.99 percent complete\n", "00:08:20.53 5.23 percent complete\n", "00:08:42.12 5.46 percent complete\n", "00:09:05.54 5.70 percent complete\n", "00:09:27.72 5.94 percent complete\n", "00:09:49.86 6.18 percent complete\n", "00:10:14.16 6.41 percent complete\n", "00:10:37.63 6.65 percent complete\n", "00:10:58.94 6.89 percent complete\n", "00:11:21.80 7.13 percent complete\n", "00:11:45.94 7.36 percent complete\n", "00:12:11.02 7.60 percent complete\n", "00:12:34.00 7.84 percent complete\n", "00:12:55.73 8.08 percent complete\n", "00:13:18.24 8.31 percent complete\n", "00:13:41.73 8.55 percent complete\n", "00:14:04.32 8.79 percent complete\n", "00:14:28.38 9.03 percent complete\n", "00:14:50.08 9.27 percent complete\n", "00:15:12.34 9.50 percent complete\n", "00:15:36.21 9.74 percent complete\n", "00:15:59.59 9.98 percent complete\n", "00:16:21.11 10.22 percent complete\n", "00:16:44.50 10.45 percent complete\n", "00:17:07.51 10.69 percent complete\n", "00:17:31.10 10.93 percent complete\n", "00:17:53.86 11.17 percent complete\n", "00:18:16.59 11.40 percent complete\n", "00:18:38.77 11.64 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '”']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:19:00.18 11.88 percent complete\n", "00:19:23.67 12.12 percent complete\n", "00:19:46.78 12.35 percent complete\n", "00:20:09.21 12.59 percent complete\n", "00:20:31.55 12.83 percent complete\n", "00:20:54.99 13.07 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:21:16.35 13.30 percent complete\n", "00:21:40.35 13.54 percent complete\n", "00:22:02.52 13.78 percent complete\n", "00:22:26.25 14.02 percent complete\n", "00:22:48.93 14.25 percent complete\n", "00:23:10.91 14.49 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:23:33.74 14.73 percent complete\n", "00:23:56.98 14.97 percent complete\n", "00:24:18.77 15.20 percent complete\n", "00:24:42.38 15.44 percent complete\n", "00:25:06.24 15.68 percent complete\n", "00:25:29.38 15.92 percent complete\n", "00:25:51.90 16.15 percent complete\n", "00:26:14.36 16.39 percent complete\n", "00:26:38.03 16.63 percent complete\n", "00:27:00.03 16.87 percent complete\n", "00:27:22.03 17.10 percent complete\n", "00:27:46.56 17.34 percent complete\n", "00:28:08.08 17.58 percent complete\n", "00:28:31.27 17.82 percent complete\n", "00:28:53.52 18.05 percent complete\n", "00:29:15.49 18.29 percent complete\n", "00:29:36.54 18.53 percent complete\n", "00:29:57.27 18.77 percent complete\n", "00:30:19.39 19.01 percent complete\n", "00:30:41.44 19.24 percent complete\n", "00:31:02.77 19.48 percent complete\n", "00:31:23.43 19.72 percent complete\n", "00:31:45.61 19.96 percent complete\n", "00:32:06.92 20.19 percent complete\n", "00:32:28.40 20.43 percent complete\n", "00:32:50.66 20.67 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '. .']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:33:12.62 20.91 percent complete\n", "00:33:35.20 21.14 percent complete\n", "00:33:57.20 21.38 percent complete\n", "00:34:18.99 21.62 percent complete\n", "00:34:41.33 21.86 percent complete\n", "00:35:02.17 22.09 percent complete\n", "00:35:23.71 22.33 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '□ ․ ․ ․ ․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:35:45.05 22.57 percent complete\n", "00:36:06.62 22.81 percent complete\n", "00:36:28.70 23.04 percent complete\n", "00:36:49.11 23.28 percent complete\n", "00:37:11.26 23.52 percent complete\n", "00:37:32.83 23.76 percent complete\n", "00:37:54.10 23.99 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '⇩ ⇩']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:38:15.52 24.23 percent complete\n", "00:38:37.19 24.47 percent complete\n", "00:38:58.69 24.71 percent complete\n", "00:39:21.27 24.94 percent complete\n", "00:39:41.91 25.18 percent complete\n", "00:40:04.18 25.42 percent complete\n", "00:40:25.68 25.66 percent complete\n", "00:40:46.67 25.89 percent complete\n", "00:41:07.00 26.13 percent complete\n", "00:41:28.87 26.37 percent complete\n", "00:41:49.80 26.61 percent complete\n", "00:42:11.61 26.84 percent complete\n", "00:42:32.93 27.08 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '▾ ▾ ▾ ▾']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:42:54.64 27.32 percent complete\n", "00:43:15.73 27.56 percent complete\n", "00:43:36.50 27.80 percent complete\n", "00:43:57.98 28.03 percent complete\n", "00:44:20.30 28.27 percent complete\n", "00:44:41.95 28.51 percent complete\n", "00:45:02.21 28.75 percent complete\n", "00:45:23.14 28.98 percent complete\n", "00:45:44.47 29.22 percent complete\n", "00:46:05.40 29.46 percent complete\n", "00:46:26.27 29.70 percent complete\n", "00:46:47.76 29.93 percent complete\n", "00:47:09.16 30.17 percent complete\n", "00:47:30.80 30.41 percent complete\n", "00:47:51.82 30.65 percent complete\n", "00:48:12.29 30.88 percent complete\n", "00:48:33.14 31.12 percent complete\n", "00:48:54.88 31.36 percent complete\n", "00:49:15.91 31.60 percent complete\n", "00:49:38.08 31.83 percent complete\n", "00:49:58.85 32.07 percent complete\n", "00:50:18.80 32.31 percent complete\n", "00:50:40.71 32.55 percent complete\n", "00:51:02.48 32.78 percent complete\n", "00:51:24.31 33.02 percent complete\n", "00:51:44.98 33.26 percent complete\n", "00:52:05.36 33.50 percent complete\n", "00:52:26.49 33.73 percent complete\n", "00:52:47.55 33.97 percent complete\n", "00:53:07.69 34.21 percent complete\n", "00:53:28.41 34.45 percent complete\n", "00:53:50.19 34.68 percent complete\n", "00:54:10.86 34.92 percent complete\n", "00:54:32.19 35.16 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '” *']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "00:54:53.47 35.40 percent complete\n", "00:55:15.29 35.63 percent complete\n", "00:55:35.80 35.87 percent complete\n", "00:55:56.41 36.11 percent complete\n", "00:56:16.88 36.35 percent complete\n", "00:56:38.90 36.59 percent complete\n", "00:57:00.41 36.82 percent complete\n", "00:57:22.65 37.06 percent complete\n", "00:57:42.91 37.30 percent complete\n", "00:58:04.02 37.54 percent complete\n", "00:58:25.08 37.77 percent complete\n", "00:58:46.45 38.01 percent complete\n", "00:59:07.79 38.25 percent complete\n", "00:59:29.68 38.49 percent complete\n", "00:59:51.17 38.72 percent complete\n", "01:00:13.82 38.96 percent complete\n", "01:00:34.39 39.20 percent complete\n", "01:00:54.94 39.44 percent complete\n", "01:01:16.20 39.67 percent complete\n", "01:01:37.07 39.91 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '*']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:01:57.74 40.15 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '→ ․ ․ ․ ․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:02:19.83 40.39 percent complete\n", "01:02:41.15 40.62 percent complete\n", "01:03:01.95 40.86 percent complete\n", "01:03:22.18 41.10 percent complete\n", "01:03:43.12 41.34 percent complete\n", "01:04:04.92 41.57 percent complete\n", "01:04:25.17 41.81 percent complete\n", "01:04:45.75 42.05 percent complete\n", "01:05:08.14 42.29 percent complete\n", "01:05:29.03 42.52 percent complete\n", "01:05:50.39 42.76 percent complete\n", "01:06:11.45 43.00 percent complete\n", "01:06:31.60 43.24 percent complete\n", "01:06:52.64 43.47 percent complete\n", "01:07:13.99 43.71 percent complete\n", "01:07:34.49 43.95 percent complete\n", "01:07:54.70 44.19 percent complete\n", "01:08:16.02 44.42 percent complete\n", "01:08:36.91 44.66 percent complete\n", "01:08:57.70 44.90 percent complete\n", "01:09:18.69 45.14 percent complete\n", "01:09:40.28 45.37 percent complete\n", "01:10:00.90 45.61 percent complete\n", "01:10:21.96 45.85 percent complete\n", "01:10:43.41 46.09 percent complete\n", "01:11:05.41 46.33 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓ ↓ ↓ ↓']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:11:26.97 46.56 percent complete\n", "01:11:48.78 46.80 percent complete\n", "01:12:10.30 47.04 percent complete\n", "01:12:31.49 47.28 percent complete\n", "01:12:52.69 47.51 percent complete\n", "01:13:13.70 47.75 percent complete\n", "01:13:35.44 47.99 percent complete\n", "01:13:57.51 48.23 percent complete\n", "01:14:18.13 48.46 percent complete\n", "01:14:39.48 48.70 percent complete\n", "01:14:59.74 48.94 percent complete\n", "01:15:20.85 49.18 percent complete\n", "01:15:41.20 49.41 percent complete\n", "01:16:02.37 49.65 percent complete\n", "01:16:22.81 49.89 percent complete\n", "01:16:44.21 50.13 percent complete\n", "01:17:05.71 50.36 percent complete\n", "01:17:25.84 50.60 percent complete\n", "01:17:46.60 50.84 percent complete\n", "01:18:08.16 51.08 percent complete\n", "01:18:28.46 51.31 percent complete\n", "01:18:49.29 51.55 percent complete\n", "01:19:09.98 51.79 percent complete\n", "01:19:32.26 52.03 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↑ ↑ ↑']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:19:53.50 52.26 percent complete\n", "01:20:13.37 52.50 percent complete\n", "01:20:34.80 52.74 percent complete\n", "01:20:56.58 52.98 percent complete\n", "01:21:18.52 53.21 percent complete\n", "01:21:40.49 53.45 percent complete\n", "01:22:01.68 53.69 percent complete\n", "01:22:23.90 53.93 percent complete\n", "01:22:44.35 54.16 percent complete\n", "01:23:04.80 54.40 percent complete\n", "01:23:24.84 54.64 percent complete\n", "01:23:45.69 54.88 percent complete\n", "01:24:07.79 55.12 percent complete\n", "01:24:29.13 55.35 percent complete\n", "01:24:50.70 55.59 percent complete\n", "01:25:11.89 55.83 percent complete\n", "01:25:32.91 56.07 percent complete\n", "01:25:53.99 56.30 percent complete\n", "01:26:14.87 56.54 percent complete\n", "01:26:35.53 56.78 percent complete\n", "01:26:57.28 57.02 percent complete\n", "01:27:18.33 57.25 percent complete\n", "01:27:39.39 57.49 percent complete\n", "01:28:02.07 57.73 percent complete\n", "01:28:23.27 57.97 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '— ― ― ― ― ― ― ―']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:28:44.65 58.20 percent complete\n", "01:29:05.16 58.44 percent complete\n", "01:29:26.17 58.68 percent complete\n", "01:29:47.86 58.92 percent complete\n", "01:30:09.04 59.15 percent complete\n", "01:30:30.36 59.39 percent complete\n", "01:30:53.36 59.63 percent complete\n", "01:31:13.33 59.87 percent complete\n", "01:31:34.13 60.10 percent complete\n", "01:31:53.96 60.34 percent complete\n", "01:32:15.21 60.58 percent complete\n", "01:32:36.49 60.82 percent complete\n", "01:32:57.69 61.05 percent complete\n", "01:33:18.32 61.29 percent complete\n", "01:33:39.56 61.53 percent complete\n", "01:34:01.27 61.77 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '․ ․ ․ ․ ․ ․ ․ ․ ․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:34:22.57 62.00 percent complete\n", "01:34:43.88 62.24 percent complete\n", "01:35:05.39 62.48 percent complete\n", "01:35:26.08 62.72 percent complete\n", "01:35:46.82 62.95 percent complete\n", "01:36:07.74 63.19 percent complete\n", "01:36:29.62 63.43 percent complete\n", "01:36:50.09 63.67 percent complete\n", "01:37:11.16 63.91 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '\\']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:37:31.54 64.14 percent complete\n", "01:37:51.88 64.38 percent complete\n", "01:38:12.26 64.62 percent complete\n", "01:38:34.38 64.86 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓ → →']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:38:54.92 65.09 percent complete\n", "01:39:16.11 65.33 percent complete\n", "01:39:37.93 65.57 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '▸']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:39:58.61 65.81 percent complete\n", "01:40:19.11 66.04 percent complete\n", "01:40:39.11 66.28 percent complete\n", "01:40:59.75 66.52 percent complete\n", "01:41:20.04 66.76 percent complete\n", "01:41:41.59 66.99 percent complete\n", "01:42:02.39 67.23 percent complete\n", "01:42:24.29 67.47 percent complete\n", "01:42:46.35 67.71 percent complete\n", "01:43:06.57 67.94 percent complete\n", "01:43:28.45 68.18 percent complete\n", "01:43:48.73 68.42 percent complete\n", "01:44:09.76 68.66 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓ ↑ ↑']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:44:31.08 68.89 percent complete\n", "01:44:51.88 69.13 percent complete\n", "01:45:13.94 69.37 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↑']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:45:35.30 69.61 percent complete\n", "01:45:55.85 69.84 percent complete\n", "01:46:17.32 70.08 percent complete\n", "01:46:38.50 70.32 percent complete\n", "01:46:59.41 70.56 percent complete\n", "01:47:20.42 70.79 percent complete\n", "01:47:41.07 71.03 percent complete\n", "01:48:01.68 71.27 percent complete\n", "01:48:23.01 71.51 percent complete\n", "01:48:43.90 71.74 percent complete\n", "01:49:05.56 71.98 percent complete\n", "01:49:26.41 72.22 percent complete\n", "01:49:47.57 72.46 percent complete\n", "01:50:08.68 72.69 percent complete\n", "01:50:30.33 72.93 percent complete\n", "01:50:51.63 73.17 percent complete\n", "01:51:12.99 73.41 percent complete\n", "01:51:34.45 73.65 percent complete\n", "01:51:55.31 73.88 percent complete\n", "01:52:15.55 74.12 percent complete\n", "01:52:37.39 74.36 percent complete\n", "01:52:57.69 74.60 percent complete\n", "01:53:18.77 74.83 percent complete\n", "01:53:40.75 75.07 percent complete\n", "01:54:01.98 75.31 percent complete\n", "01:54:22.74 75.55 percent complete\n", "01:54:43.95 75.78 percent complete\n", "01:55:04.66 76.02 percent complete\n", "01:55:25.50 76.26 percent complete\n", "01:55:46.33 76.50 percent complete\n", "01:56:07.53 76.73 percent complete\n", "01:56:29.66 76.97 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓ ↓ ↓']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "01:56:50.14 77.21 percent complete\n", "01:57:10.67 77.45 percent complete\n", "01:57:31.97 77.68 percent complete\n", "01:57:53.28 77.92 percent complete\n", "01:58:14.92 78.16 percent complete\n", "01:58:35.38 78.40 percent complete\n", "01:58:56.91 78.63 percent complete\n", "01:59:19.15 78.87 percent complete\n", "01:59:40.13 79.11 percent complete\n", "01:59:59.84 79.35 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '●']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:00:20.28 79.58 percent complete\n", "02:00:40.97 79.82 percent complete\n", "02:01:02.45 80.06 percent complete\n", "02:01:23.50 80.30 percent complete\n", "02:01:44.88 80.53 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '↓ ↓']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:02:06.36 80.77 percent complete\n", "02:02:27.16 81.01 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '․ ․ ․ ․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:02:49.16 81.25 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '⇧']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:03:10.46 81.48 percent complete\n", "02:03:30.79 81.72 percent complete\n", "02:03:51.99 81.96 percent complete\n", "02:04:12.28 82.20 percent complete\n", "02:04:33.60 82.44 percent complete\n", "02:04:54.65 82.67 percent complete\n", "02:05:16.59 82.91 percent complete\n", "02:05:38.05 83.15 percent complete\n", "02:05:59.48 83.39 percent complete\n", "02:06:20.02 83.62 percent complete\n", "02:06:41.69 83.86 percent complete\n", "02:07:02.11 84.10 percent complete\n", "02:07:23.23 84.34 percent complete\n", "02:07:44.85 84.57 percent complete\n", "02:08:06.74 84.81 percent complete\n", "02:08:27.44 85.05 percent complete\n", "02:08:48.06 85.29 percent complete\n", "02:09:09.08 85.52 percent complete\n", "02:09:30.60 85.76 percent complete\n", "02:09:51.78 86.00 percent complete\n", "02:10:12.14 86.24 percent complete\n", "02:10:32.79 86.47 percent complete\n", "02:10:54.70 86.71 percent complete\n", "02:11:15.84 86.95 percent complete\n", "02:11:35.87 87.19 percent complete\n", "02:11:56.10 87.42 percent complete\n", "02:12:16.95 87.66 percent complete\n", "02:12:38.57 87.90 percent complete\n", "02:12:58.93 88.14 percent complete\n", "02:13:18.99 88.37 percent complete\n", "02:13:41.31 88.61 percent complete\n", "02:14:01.41 88.85 percent complete\n", "02:14:22.72 89.09 percent complete\n", "02:14:43.65 89.32 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '→ →']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:15:04.41 89.56 percent complete\n", "02:15:26.29 89.80 percent complete\n", "02:15:47.92 90.04 percent complete\n", "02:16:09.38 90.27 percent complete\n", "02:16:31.90 90.51 percent complete\n", "02:16:53.25 90.75 percent complete\n", "02:17:14.60 90.99 percent complete\n", "02:17:36.53 91.23 percent complete\n", "02:17:56.41 91.46 percent complete\n", "02:18:17.04 91.70 percent complete\n", "02:18:38.28 91.94 percent complete\n", "02:18:59.20 92.18 percent complete\n", "02:19:20.81 92.41 percent complete\n", "02:19:42.87 92.65 percent complete\n", "02:20:03.77 92.89 percent complete\n", "02:20:24.92 93.13 percent complete\n", "02:20:45.84 93.36 percent complete\n", "02:21:07.57 93.60 percent complete\n", "02:21:28.39 93.84 percent complete\n", "02:21:48.66 94.08 percent complete\n", "02:22:09.81 94.31 percent complete\n", "02:22:30.95 94.55 percent complete\n", "02:22:51.24 94.79 percent complete\n", "02:23:12.67 95.03 percent complete\n", "02:23:33.71 95.26 percent complete\n", "02:23:53.89 95.50 percent complete\n", "02:24:15.22 95.74 percent complete\n", "02:24:36.14 95.98 percent complete\n", "02:24:57.43 96.21 percent complete\n", "02:25:18.39 96.45 percent complete\n", "02:25:40.77 96.69 percent complete\n", "02:26:01.82 96.93 percent complete\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '․ ․ ․']\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "02:26:23.14 97.16 percent complete\n", "02:26:44.48 97.40 percent complete\n", "02:27:05.88 97.64 percent complete\n", "02:27:26.90 97.88 percent complete\n", "02:27:47.24 98.11 percent complete\n", "02:28:07.91 98.35 percent complete\n", "02:28:28.42 98.59 percent complete\n", "02:28:50.20 98.83 percent complete\n", "02:29:11.01 99.06 percent complete\n", "02:29:31.39 99.30 percent complete\n", "02:29:51.79 99.54 percent complete\n", "02:30:12.57 99.78 percent complete\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "hxxBOCA-xXhy", "outputId": "84f5c76d-fa6c-445a-f120-cf782de13db4", "colab": { "base_uri": "https://localhost:8080/", "height": 799 } }, "source": [ "#TODO: Skip for retrain\n", "# This section does the split between train/dev for the parallel corpora then saves them as separate files\n", "# We use 1000 dev test and the given test set.\n", "import csv\n", "\n", "# Do the split between dev/train and create parallel corpora\n", "num_dev_patterns = 1000\n", "\n", "# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.\n", "if lc: # Julia: making lowercasing optional\n", " df_pp[\"source_sentence\"] = df_pp[\"source_sentence\"].str.lower()\n", " df_pp[\"target_sentence\"] = df_pp[\"target_sentence\"].str.lower()\n", "\n", "# Julia: test sets are already generated\n", "dev = df_pp.tail(num_dev_patterns) # Herman: Error in original\n", "stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)\n", "\n", "with open(\"train.\"+source_language, \"w\") as src_file, open(\"train.\"+target_language, \"w\") as trg_file:\n", " for index, row in stripped.iterrows():\n", " src_file.write(row[\"source_sentence\"]+\"\\n\")\n", " trg_file.write(row[\"target_sentence\"]+\"\\n\")\n", " \n", "with open(\"dev.\"+source_language, \"w\") as src_file, open(\"dev.\"+target_language, \"w\") as trg_file:\n", " for index, row in dev.iterrows():\n", " src_file.write(row[\"source_sentence\"]+\"\\n\")\n", " trg_file.write(row[\"target_sentence\"]+\"\\n\")\n", "\n", "#stripped[[\"source_sentence\"]].to_csv(\"train.\"+source_language, header=False, index=False) # Herman: Added `header=False` everywhere\n", "#stripped[[\"target_sentence\"]].to_csv(\"train.\"+target_language, header=False, index=False) # Julia: Problematic handling of quotation marks.\n", "\n", "#dev[[\"source_sentence\"]].to_csv(\"dev.\"+source_language, header=False, index=False)\n", "#dev[[\"target_sentence\"]].to_csv(\"dev.\"+target_language, header=False, index=False)\n", "\n", "# Doublecheck the format below. There should be no extra quotation marks or weird characters.\n", "! head train.*\n", "! head dev.*" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "==> train.en <==\n", "It identifies us as Jesus ’ followers and as imitators of Jehovah , the Source of love .\n", "To live honestly is to lead a better life .\n", "Exports : Oil , cocoa , coffee , cotton , wood , aluminum\n", "Your reminders are what I am fond of . ”\n", "After a meal , the pancreas responds to increases in the glucose content of the blood , releasing the proper amount of insulin\n", "Jehovah invites people of all nations to draw close to him in prayer .\n", "4 : 18 - 22 .\n", "But when the Israelites were delivered from Egyptian bondage , the prophet Moses had Joseph’s bones taken along for burial in the Promised Land .\n", "The sea was getting rougher , and the fight to stay afloat made me very tired .\n", "Joyous and Thankful Despite Loss ( N .\n", "\n", "==> train.yo <==\n", "Òun ló ń jẹ́ káwọn èèyàn dá wa mọ̀ pé ọmọlẹ́yìn Jésù ni wá àti pé à ń fara wé Jèhófà , Ọlọ́run ìfẹ́ .\n", "Sísọ bá a ṣe jẹ́ gan - an máa ń jẹ́ kí ìgbésí ayé èèyàn dára .\n", "Ohun Àmúṣọrọ̀ : Epo rọ̀bì , kòkó , kọfí , òwú , igi àti tángaran\n", "Àwọn ìránnilétí rẹ ni mo ní ìfẹ́ni fún . ”\n", "Ẹni Tí Ara Rẹ̀ Le Ẹni Tó Ní Àrùn Ẹni Tó Ní Àrùn Àtọ̀gbẹ Oríṣi Kìíní Àtọ̀gbẹ Oríṣi Kejì\n", "Jèhófà fẹ́ kí gbogbo èèyàn orílẹ̀ - èdè sún mọ́ òun nípasẹ̀ àdúrà .\n", "4 : 18 - 22 .\n", "Àmọ́ nígbà táwọn ọmọ Ísírẹ́lì gba òmìnira kúrò lọ́wọ́ àwọn ará Íjíbítì , wòlíì Mósè ní kí wọ́n kó àwọn egungun Jósẹ́fù dání kí wọ́n lè sin ín sí Ilẹ̀ Ìlérí .\n", "Ìrugùdù òkun náà ń le sí i , àárẹ̀ sì ti mú mi gan - an bí mo ṣe ń sa gbogbo ipá mi kí n má bàa rì lọ sísàlẹ̀ .\n", "Òṣùṣù Ọwọ̀ Ni Wá ( M .\n", "==> dev.en <==\n", "On his first missionary tour , he started from Antioch , where Jesus ’ followers were first called Christians .\n", "I am alive today because of applying Bible principles\n", "That was the first resurrection of Bible record .\n", "Like the guide in our illustration , Jehovah kindly extends his helping hand and his friendship to those who seek to walk with him .\n", "18 Should the Name Jehovah Appear in the New Testament ?\n", "Many of our modern - day fellow believers have similarly demonstrated trust in Jehovah and have taken appropriate action .\n", "Continue in the Spiritual Paradise\n", "What can we learn about clothing from God’s Law to the Israelites ?\n", "Additionally , there have been extensive changes in the grammar and syntax of the language .\n", "Rather , we need to cultivate strong trust in Jehovah while taking whatever appropriate action we can .\n", "\n", "==> dev.yo <==\n", "Ìlú Áńtíókù , tá a ti kọ́kọ́ pe àwọn ọmọ ẹ̀yìn Jésù ní Kristẹni , ni Pọ́ọ̀lù ti bẹ̀rẹ̀ ìrìn àjò míṣọ́nnárì rẹ̀ àkọ́kọ́ .\n", "Torí pé mò ń tẹ̀ lé àwọn ìlànà Bíbélì ló jẹ́ kí n wà láàyè títí dòní\n", "Àjíǹde àkọ́kọ́ tí Bíbélì mẹ́nu kàn nìyẹn .\n", "Bíi ti afinimọ̀nà tá a mẹ́nu kàn nínú àpèjúwe wa yẹn , Jèhófà sọ pé òun máa ran àwọn tó bá fẹ́ bá òun rìn lọ́wọ́ , ó sì ní kí wọ́n wá bá òun dọ́rẹ̀ẹ́ .\n", "18 Ṣé Ó Yẹ Kí Orúkọ Náà Jèhófà Wà Nínú Májẹ̀mú Tuntun ?\n", "Ọ̀pọ̀ àwọn ará wa tó jẹ́ olóòótọ́ lóde òní ló ti fi hàn pé àwọn gbẹ́kẹ̀ lé Jèhófà , tí wọ́n sì tún gbé ìgbésẹ̀ tó yẹ .\n", "Má Ṣe Kúrò Nínú Párádísè Tẹ̀mí\n", "Kí la rí kọ́ nínú Òfin tí Ọlọ́run fún àwọn ọmọ Ísírẹ́lì nípa ìmúra ?\n", "Yàtọ̀ síyẹn , àtúnṣe kékeré kọ́ ni wọ́n ti ṣe sí gírámà àti ọ̀nà tí wọ́n ń gbà ṣètò ọ̀rọ̀ nínú àwọn èdè yìí .\n", "Kàkà bẹ́ẹ̀ , a gbọ́dọ̀ gbẹ́kẹ̀ lé Jèhófà , ká sì gbé àwọn ìgbésẹ̀ tó yẹ .\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "epeCydmCyS8X" }, "source": [ "\n", "\n", "---\n", "\n", "\n", "## Installation of JoeyNMT\n", "\n", "JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. This is Joey-NMT.\n", "2020-04-12 09:36:40,192 dev bleu: 30.48 [Beam search decoding with beam size = 5 and alpha = 1.0]\n", "2020-04-12 09:37:12,208 test bleu: 39.44 [Beam search decoding with beam size = 5 and alpha = 1.0]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "KaXDFfm-zgjK", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": 0, "outputs": [] } ] }