{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "s_qNSzzyaCbD" }, "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2023-08-11T11:07:36.887650Z", "iopub.status.busy": "2023-08-11T11:07:36.886995Z", "iopub.status.idle": "2023-08-11T11:07:36.891012Z", "shell.execute_reply": "2023-08-11T11:07:36.890336Z" }, "id": "jmjh290raIky" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "AOpGoE2T-YXS" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "ES8iTKcdPCLt" }, "source": [ "# Subword tokenizers\n", "\n", "This tutorial demonstrates how to generate a subword vocabulary from a dataset, and use it to build a `text.BertTokenizer` from the vocabulary.\n", "\n", "The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.\n", "\n", "Objective: At the end of this tutorial you'll have built a complete end-to-end wordpiece tokenizer and detokenizer from scratch, and saved it as a `saved_model` that you can load and use in this [translation tutorial](https://tensorflow.org/text/tutorials/transformer)." ] }, { "cell_type": "markdown", "metadata": { "id": "BHfrtG1YPJdR" }, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": { "id": "iIMuBnQO6ZoV" }, "source": [ "The `tensorflow_text` package includes TensorFlow implementations of many common tokenizers. This includes three subword-style tokenizers:\n", "\n", "* `text.BertTokenizer` - The `BertTokenizer` class is a higher level interface. It includes BERT's token splitting algorithm and a `WordPieceTokenizer`. It takes **sentences** as input and returns **token-IDs**.\n", "* `text.WordpieceTokenizer` - The `WordPieceTokenizer` class is a lower level interface. It only implements the [WordPiece algorithm](#applying_wordpiece). You must standardize and split the text into words before calling it. It takes **words** as input and returns token-IDs.\n", "* `text.SentencepieceTokenizer` - The `SentencepieceTokenizer` requires a more complex setup. Its initializer requires a pre-trained sentencepiece model. See the [google/sentencepiece repository](https://github.com/google/sentencepiece#train-sentencepiece-model) for instructions on how to build one of these models. It can accept **sentences** as input when tokenizing.\n", "\n", "This tutorial builds a Wordpiece vocabulary in a top down manner, starting from existing words. This process doesn't work for Japanese, Chinese, or Korean since these languages don't have clear multi-character units. To tokenize these languages consider using `text.SentencepieceTokenizer`, `text.UnicodeCharTokenizer` or [this approach](https://tfhub.dev/google/zh_segmentation/1). " ] }, { "cell_type": "markdown", "metadata": { "id": "swymtxpl7W7w" }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:07:36.894748Z", "iopub.status.busy": "2023-08-11T11:07:36.894313Z", "iopub.status.idle": "2023-08-11T11:08:05.647249Z", "shell.execute_reply": "2023-08-11T11:08:05.646422Z" }, "id": "rJTYbk1E9QOk" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n", "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "!pip install -q -U \"tensorflow-text==2.11.*\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:05.651342Z", "iopub.status.busy": "2023-08-11T11:08:05.651073Z", "iopub.status.idle": "2023-08-11T11:08:08.314047Z", "shell.execute_reply": "2023-08-11T11:08:08.313134Z" }, "id": "XFG0NDRu5mYQ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "tensorflow 2.11.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "!pip install -q tensorflow_datasets" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:08.317980Z", "iopub.status.busy": "2023-08-11T11:08:08.317735Z", "iopub.status.idle": "2023-08-11T11:08:11.244419Z", "shell.execute_reply": "2023-08-11T11:08:11.243662Z" }, "id": "JjJJyJTZYebt" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-08-11 11:08:10.432347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2023-08-11 11:08:10.432451: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2023-08-11 11:08:10.432460: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" ] } ], "source": [ "import collections\n", "import os\n", "import pathlib\n", "import re\n", "import string\n", "import sys\n", "import tempfile\n", "import time\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "import tensorflow_datasets as tfds\n", "import tensorflow_text as text\n", "import tensorflow as tf" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:11.248812Z", "iopub.status.busy": "2023-08-11T11:08:11.248023Z", "iopub.status.idle": "2023-08-11T11:08:11.251713Z", "shell.execute_reply": "2023-08-11T11:08:11.251084Z" }, "id": "QZi9RstHxO_Z" }, "outputs": [], "source": [ "tf.get_logger().setLevel('ERROR')\n", "pwd = pathlib.Path.cwd()" ] }, { "cell_type": "markdown", "metadata": { "id": "wzJbGA5N5mXr" }, "source": [ "## Download the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "kC9TeTd47j8p" }, "source": [ "Fetch the Portuguese/English translation dataset from [tfds](https://tensorflow.org/datasets):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:11.255399Z", "iopub.status.busy": "2023-08-11T11:08:11.254796Z", "iopub.status.idle": "2023-08-11T11:08:16.876922Z", "shell.execute_reply": "2023-08-11T11:08:16.876263Z" }, "id": "qDaAOTKHNy8e" }, "outputs": [], "source": [ "examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,\n", " as_supervised=True)\n", "train_examples, val_examples = examples['train'], examples['validation'] " ] }, { "cell_type": "markdown", "metadata": { "id": "5GHc3O2W8Hgg" }, "source": [ "This dataset produces Portuguese/English sentence pairs:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:16.880845Z", "iopub.status.busy": "2023-08-11T11:08:16.880611Z", "iopub.status.idle": "2023-08-11T11:08:17.469768Z", "shell.execute_reply": "2023-08-11T11:08:17.469027Z" }, "id": "-_ezZT8w8GqD" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Portuguese: e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .\n", "English: and when you improve searchability , you actually take away the one advantage of print , which is serendipity .\n" ] } ], "source": [ "for pt, en in train_examples.take(1):\n", " print(\"Portuguese: \", pt.numpy().decode('utf-8'))\n", " print(\"English: \", en.numpy().decode('utf-8'))" ] }, { "cell_type": "markdown", "metadata": { "id": "nNGwm45vKttj" }, "source": [ "Note a few things about the example sentences above:\n", "* They're lower case.\n", "* There are spaces around the punctuation.\n", "* It's not clear if or what unicode normalization is being used." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:17.473707Z", "iopub.status.busy": "2023-08-11T11:08:17.473026Z", "iopub.status.idle": "2023-08-11T11:08:17.505169Z", "shell.execute_reply": "2023-08-11T11:08:17.504577Z" }, "id": "Pm5Eah5F6B1I" }, "outputs": [], "source": [ "train_en = train_examples.map(lambda pt, en: en)\n", "train_pt = train_examples.map(lambda pt, en: pt)" ] }, { "cell_type": "markdown", "metadata": { "id": "VCD57yALsF0D" }, "source": [ "## Generate the vocabulary\n", "\n", "This section generates a wordpiece vocabulary from a dataset. If you already have a vocabulary file and just want to see how to build a `text.BertTokenizer` or `text.WordpieceTokenizer` tokenizer with it then you can skip ahead to the [Build the tokenizer](#build_the_tokenizer) section." ] }, { "cell_type": "markdown", "metadata": { "id": "v4CX7_KlO8lX" }, "source": [ "Note: The vocabulary generation code used in this tutorial is optimized for **simplicity**. If you need a more scalable solution consider using the Apache Beam implementation available in [tools/wordpiece_vocab/generate_vocab.py](https://github.com/tensorflow/text/blob/master/tensorflow_text/tools/wordpiece_vocab/generate_vocab.py)" ] }, { "cell_type": "markdown", "metadata": { "id": "R74W3QabgWmX" }, "source": [ "The vocabulary generation code is included in the `tensorflow_text` pip package. It is not imported by default , you need to manually import it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:17.508754Z", "iopub.status.busy": "2023-08-11T11:08:17.508517Z", "iopub.status.idle": "2023-08-11T11:08:17.513174Z", "shell.execute_reply": "2023-08-11T11:08:17.512549Z" }, "id": "iqX1fYdpnLS2" }, "outputs": [], "source": [ "from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab" ] }, { "cell_type": "markdown", "metadata": { "id": "HaWSnj8xFgI7" }, "source": [ "The `bert_vocab.bert_vocab_from_dataset` function will generate the vocabulary. \n", "\n", "There are many arguments you can set to adjust its behavior. For this tutorial, you'll mostly use the defaults. If you want to learn more about the options, first read about [the algorithm](#algorithm), and then have a look at [the code](https://github.com/tensorflow/text/blob/master/tensorflow_text/tools/wordpiece_vocab/bert_vocab_from_dataset.py).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "6gTty2Wh-dHm" }, "source": [ "This takes about 2 minutes." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:17.516747Z", "iopub.status.busy": "2023-08-11T11:08:17.516292Z", "iopub.status.idle": "2023-08-11T11:08:17.519968Z", "shell.execute_reply": "2023-08-11T11:08:17.519362Z" }, "id": "FwFzYjBy-h8W" }, "outputs": [], "source": [ "bert_tokenizer_params=dict(lower_case=True)\n", "reserved_tokens=[\"[PAD]\", \"[UNK]\", \"[START]\", \"[END]\"]\n", "\n", "bert_vocab_args = dict(\n", " # The target vocabulary size\n", " vocab_size = 8000,\n", " # Reserved tokens that must be included in the vocabulary\n", " reserved_tokens=reserved_tokens,\n", " # Arguments for `text.BertTokenizer`\n", " bert_tokenizer_params=bert_tokenizer_params,\n", " # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`\n", " learn_params={},\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:08:17.523306Z", "iopub.status.busy": "2023-08-11T11:08:17.522789Z", "iopub.status.idle": "2023-08-11T11:09:38.100721Z", "shell.execute_reply": "2023-08-11T11:09:38.099954Z" }, "id": "PMN6Lli_3sJW" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 24s, sys: 2.83 s, total: 1min 27s\n", "Wall time: 1min 20s\n" ] } ], "source": [ "%%time\n", "pt_vocab = bert_vocab.bert_vocab_from_dataset(\n", " train_pt.batch(1000).prefetch(2),\n", " **bert_vocab_args\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "3Cl4d2O34gkH" }, "source": [ "Here are some slices of the resulting vocabulary." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:09:38.104184Z", "iopub.status.busy": "2023-08-11T11:09:38.103928Z", "iopub.status.idle": "2023-08-11T11:09:38.108100Z", "shell.execute_reply": "2023-08-11T11:09:38.107466Z" }, "id": "mfaPmX54FvhW" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', \"'\"]\n", "['no', 'por', 'mais', 'na', 'eu', 'esta', 'muito', 'isso', 'isto', 'sao']\n", "['90', 'desse', 'efeito', 'malaria', 'normalmente', 'palestra', 'recentemente', '##nca', 'bons', 'chave']\n", "['##–', '##—', '##‘', '##’', '##“', '##”', '##⁄', '##€', '##♪', '##♫']\n" ] } ], "source": [ "print(pt_vocab[:10])\n", "print(pt_vocab[100:110])\n", "print(pt_vocab[1000:1010])\n", "print(pt_vocab[-10:])" ] }, { "cell_type": "markdown", "metadata": { "id": "owkP3wbYVQv0" }, "source": [ "Write a vocabulary file:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:09:38.111553Z", "iopub.status.busy": "2023-08-11T11:09:38.111046Z", "iopub.status.idle": "2023-08-11T11:09:38.114779Z", "shell.execute_reply": "2023-08-11T11:09:38.114205Z" }, "id": "VY6v1ThkKDyZ" }, "outputs": [], "source": [ "def write_vocab_file(filepath, vocab):\n", " with open(filepath, 'w') as f:\n", " for token in vocab:\n", " print(token, file=f)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:09:38.117753Z", "iopub.status.busy": "2023-08-11T11:09:38.117326Z", "iopub.status.idle": "2023-08-11T11:09:38.124338Z", "shell.execute_reply": "2023-08-11T11:09:38.123750Z" }, "id": "X_TR5U1xWvAV" }, "outputs": [], "source": [ "write_vocab_file('pt_vocab.txt', pt_vocab)" ] }, { "cell_type": "markdown", "metadata": { "id": "0ag3qcx54nii" }, "source": [ "Use that function to generate a vocabulary from the english data:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:09:38.127459Z", "iopub.status.busy": "2023-08-11T11:09:38.126876Z", "iopub.status.idle": "2023-08-11T11:10:33.360225Z", "shell.execute_reply": "2023-08-11T11:10:33.359413Z" }, "id": "R3cMumvHWWtl" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 59.5 s, sys: 2.2 s, total: 1min 1s\n", "Wall time: 55.2 s\n" ] } ], "source": [ "%%time\n", "en_vocab = bert_vocab.bert_vocab_from_dataset(\n", " train_en.batch(1000).prefetch(2),\n", " **bert_vocab_args\n", ")\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.363881Z", "iopub.status.busy": "2023-08-11T11:10:33.363324Z", "iopub.status.idle": "2023-08-11T11:10:33.367548Z", "shell.execute_reply": "2023-08-11T11:10:33.366890Z" }, "id": "NxOpzMd8ol5B" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', \"'\"]\n", "['as', 'all', 'at', 'one', 'people', 're', 'like', 'if', 'our', 'from']\n", "['choose', 'consider', 'extraordinary', 'focus', 'generation', 'killed', 'patterns', 'putting', 'scientific', 'wait']\n", "['##_', '##`', '##ย', '##ร', '##อ', '##–', '##—', '##’', '##♪', '##♫']\n" ] } ], "source": [ "print(en_vocab[:10])\n", "print(en_vocab[100:110])\n", "print(en_vocab[1000:1010])\n", "print(en_vocab[-10:])" ] }, { "cell_type": "markdown", "metadata": { "id": "ck3LG_f34wCs" }, "source": [ "Here are the two vocabulary files:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.370842Z", "iopub.status.busy": "2023-08-11T11:10:33.370307Z", "iopub.status.idle": "2023-08-11T11:10:33.376675Z", "shell.execute_reply": "2023-08-11T11:10:33.376105Z" }, "id": "xfc2jxPznM6H" }, "outputs": [], "source": [ "write_vocab_file('en_vocab.txt', en_vocab)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.379857Z", "iopub.status.busy": "2023-08-11T11:10:33.379374Z", "iopub.status.idle": "2023-08-11T11:10:33.575341Z", "shell.execute_reply": "2023-08-11T11:10:33.574189Z" }, "id": "djehfEL6Zn-I" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "en_vocab.txt pt_vocab.txt\r\n" ] } ], "source": [ "!ls *.txt" ] }, { "cell_type": "markdown", "metadata": { "id": "Vb5ddYLTBJhk" }, "source": [ "## Build the tokenizer\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "_qgp5gvR-2tQ" }, "source": [ "The `text.BertTokenizer` can be initialized by passing the vocabulary file's path as the first argument (see the section on [tf.lookup](#tf.lookup) for other options): " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.579533Z", "iopub.status.busy": "2023-08-11T11:10:33.579240Z", "iopub.status.idle": "2023-08-11T11:10:33.593614Z", "shell.execute_reply": "2023-08-11T11:10:33.592971Z" }, "id": "gdMpt9ZEjVGu" }, "outputs": [], "source": [ "pt_tokenizer = text.BertTokenizer('pt_vocab.txt', **bert_tokenizer_params)\n", "en_tokenizer = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)" ] }, { "cell_type": "markdown", "metadata": { "id": "BhPZafCUds86" }, "source": [ "Now you can use it to encode some text. Take a batch of 3 examples from the english data:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.597123Z", "iopub.status.busy": "2023-08-11T11:10:33.596621Z", "iopub.status.idle": "2023-08-11T11:10:33.945079Z", "shell.execute_reply": "2023-08-11T11:10:33.944380Z" }, "id": "NKF0QJjtUm9T" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'\n", "b'but what if it were active ?'\n", "b\"but they did n't test for curiosity .\"\n" ] } ], "source": [ "for pt_examples, en_examples in train_examples.batch(3).take(1):\n", " for ex in en_examples:\n", " print(ex.numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "k9OEIBWopMxW" }, "source": [ "Run it through the `BertTokenizer.tokenize` method. Initially, this returns a `tf.RaggedTensor` with axes `(batch, word, word-piece)`:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:33.948777Z", "iopub.status.busy": "2023-08-11T11:10:33.948275Z", "iopub.status.idle": "2023-08-11T11:10:34.005473Z", "shell.execute_reply": "2023-08-11T11:10:34.004898Z" }, "id": "AeTM81lAc8q1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15]\n", "[87, 90, 107, 76, 129, 1852, 30]\n", "[87, 83, 149, 50, 9, 56, 664, 85, 2512, 15]\n" ] } ], "source": [ "# Tokenize the examples -> (batch, word, word-piece)\n", "token_batch = en_tokenizer.tokenize(en_examples)\n", "# Merge the word and word-piece axes -> (batch, tokens)\n", "token_batch = token_batch.merge_dims(-2,-1)\n", "\n", "for ex in token_batch.to_list():\n", " print(ex)" ] }, { "cell_type": "markdown", "metadata": { "id": "UbdIaW6kX8hu" }, "source": [ "If you replace the token IDs with their text representations (using `tf.gather`) you can see that in the first example the words `\"searchability\"` and `\"serendipity\"` have been decomposed into `\"search ##ability\"` and `\"s ##ere ##nd ##ip ##ity\"`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.008883Z", "iopub.status.busy": "2023-08-11T11:10:34.008433Z", "iopub.status.idle": "2023-08-11T11:10:34.060789Z", "shell.execute_reply": "2023-08-11T11:10:34.060215Z" }, "id": "FA6nKYx5U3Nj" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lookup each token id in the vocabulary.\n", "txt_tokens = tf.gather(en_vocab, token_batch)\n", "# Join with spaces.\n", "tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1)" ] }, { "cell_type": "markdown", "metadata": { "id": "wY2XrhyRem2O" }, "source": [ "To re-assemble words from the extracted tokens, use the `BertTokenizer.detokenize` method:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.064135Z", "iopub.status.busy": "2023-08-11T11:10:34.063886Z", "iopub.status.idle": "2023-08-11T11:10:34.119505Z", "shell.execute_reply": "2023-08-11T11:10:34.118918Z" }, "id": "toBXQSrgemRw" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = en_tokenizer.detokenize(token_batch)\n", "tf.strings.reduce_join(words, separator=' ', axis=-1)" ] }, { "cell_type": "markdown", "metadata": { "id": "WIZWWy_iueQY" }, "source": [ "> Note: `BertTokenizer.tokenize`/`BertTokenizer.detokenize` does not round\n", "trip losslessly. The result of `detokenize` will not, in general, have the\n", "same content or offsets as the input to `tokenize`. This is because of the\n", "\"basic tokenization\" step, that splits the strings into words before\n", "applying the `WordpieceTokenizer`, includes irreversible\n", "steps like lower-casing and splitting on punctuation. `WordpieceTokenizer`\n", "on the other hand **is** reversible." ] }, { "cell_type": "markdown", "metadata": { "id": "_bN30iCexTPY" }, "source": [ "## Customization and export\n", "\n", "This tutorial builds the text tokenizer and detokenizer used by the [Transformer](https://tensorflow.org/text/tutorials/transformer) tutorial. This section adds methods and processing steps to simplify that tutorial, and exports the tokenizers using `tf.saved_model` so they can be imported by the other tutorials." ] }, { "cell_type": "markdown", "metadata": { "id": "5wpc7oFkwgni" }, "source": [ "### Custom tokenization" ] }, { "cell_type": "markdown", "metadata": { "id": "NaUR9hHj0PUy" }, "source": [ "The downstream tutorials both expect the tokenized text to include `[START]` and `[END]` tokens.\n", "\n", "The `reserved_tokens` reserve space at the beginning of the vocabulary, so `[START]` and `[END]` have the same indexes for both languages:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.123281Z", "iopub.status.busy": "2023-08-11T11:10:34.122781Z", "iopub.status.idle": "2023-08-11T11:10:34.129983Z", "shell.execute_reply": "2023-08-11T11:10:34.129450Z" }, "id": "gyyoa5De0WQu" }, "outputs": [], "source": [ "START = tf.argmax(tf.constant(reserved_tokens) == \"[START]\")\n", "END = tf.argmax(tf.constant(reserved_tokens) == \"[END]\")\n", "\n", "def add_start_end(ragged):\n", " count = ragged.bounding_shape()[0]\n", " starts = tf.fill([count,1], START)\n", " ends = tf.fill([count,1], END)\n", " return tf.concat([starts, ragged, ends], axis=1)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.133340Z", "iopub.status.busy": "2023-08-11T11:10:34.132753Z", "iopub.status.idle": "2023-08-11T11:10:34.188202Z", "shell.execute_reply": "2023-08-11T11:10:34.187615Z" }, "id": "MrZjQIwZ6NHu" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = en_tokenizer.detokenize(add_start_end(token_batch))\n", "tf.strings.reduce_join(words, separator=' ', axis=-1)" ] }, { "cell_type": "markdown", "metadata": { "id": "WMmHS5VT_suH" }, "source": [ "### Custom detokenization\n", "\n", "Before exporting the tokenizers there are a couple of things you can cleanup for the downstream tutorials:\n", "\n", "1. They want to generate clean text output, so drop reserved tokens like `[START]`, `[END]` and `[PAD]`.\n", "2. They're interested in complete strings, so apply a string join along the `words` axis of the result. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.191505Z", "iopub.status.busy": "2023-08-11T11:10:34.191007Z", "iopub.status.idle": "2023-08-11T11:10:34.195298Z", "shell.execute_reply": "2023-08-11T11:10:34.194731Z" }, "id": "x9vXUQPX1ZFA" }, "outputs": [], "source": [ "def cleanup_text(reserved_tokens, token_txt):\n", " # Drop the reserved tokens, except for \"[UNK]\".\n", " bad_tokens = [re.escape(tok) for tok in reserved_tokens if tok != \"[UNK]\"]\n", " bad_token_re = \"|\".join(bad_tokens)\n", " \n", " bad_cells = tf.strings.regex_full_match(token_txt, bad_token_re)\n", " result = tf.ragged.boolean_mask(token_txt, ~bad_cells)\n", "\n", " # Join them into strings.\n", " result = tf.strings.reduce_join(result, separator=' ', axis=-1)\n", "\n", " return result" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.198519Z", "iopub.status.busy": "2023-08-11T11:10:34.198142Z", "iopub.status.idle": "2023-08-11T11:10:34.202222Z", "shell.execute_reply": "2023-08-11T11:10:34.201689Z" }, "id": "NMSpZUV7sQYw" }, "outputs": [ { "data": { "text/plain": [ "array([b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .',\n", " b'but what if it were active ?',\n", " b\"but they did n't test for curiosity .\"], dtype=object)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "en_examples.numpy()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.205335Z", "iopub.status.busy": "2023-08-11T11:10:34.204817Z", "iopub.status.idle": "2023-08-11T11:10:34.243506Z", "shell.execute_reply": "2023-08-11T11:10:34.242904Z" }, "id": "yB3MJhNvkuBb" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_batch = en_tokenizer.tokenize(en_examples).merge_dims(-2,-1)\n", "words = en_tokenizer.detokenize(token_batch)\n", "words" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.246730Z", "iopub.status.busy": "2023-08-11T11:10:34.246329Z", "iopub.status.idle": "2023-08-11T11:10:34.272062Z", "shell.execute_reply": "2023-08-11T11:10:34.271514Z" }, "id": "ED5rMeZE6HT3" }, "outputs": [ { "data": { "text/plain": [ "array([b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .',\n", " b'but what if it were active ?',\n", " b\"but they did n ' t test for curiosity .\"], dtype=object)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cleanup_text(reserved_tokens, words).numpy()" ] }, { "cell_type": "markdown", "metadata": { "id": "HEfEdRi11Re4" }, "source": [ "### Export" ] }, { "cell_type": "markdown", "metadata": { "id": "uFuo1KZjpEPR" }, "source": [ "The following code block builds a `CustomTokenizer` class to contain the `text.BertTokenizer` instances, the custom logic, and the `@tf.function` wrappers required for export. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.275552Z", "iopub.status.busy": "2023-08-11T11:10:34.274994Z", "iopub.status.idle": "2023-08-11T11:10:34.284520Z", "shell.execute_reply": "2023-08-11T11:10:34.283895Z" }, "id": "f1q1hCpH72Vj" }, "outputs": [], "source": [ "class CustomTokenizer(tf.Module):\n", " def __init__(self, reserved_tokens, vocab_path):\n", " self.tokenizer = text.BertTokenizer(vocab_path, lower_case=True)\n", " self._reserved_tokens = reserved_tokens\n", " self._vocab_path = tf.saved_model.Asset(vocab_path)\n", "\n", " vocab = pathlib.Path(vocab_path).read_text().splitlines()\n", " self.vocab = tf.Variable(vocab)\n", "\n", " ## Create the signatures for export: \n", "\n", " # Include a tokenize signature for a batch of strings. \n", " self.tokenize.get_concrete_function(\n", " tf.TensorSpec(shape=[None], dtype=tf.string))\n", " \n", " # Include `detokenize` and `lookup` signatures for:\n", " # * `Tensors` with shapes [tokens] and [batch, tokens]\n", " # * `RaggedTensors` with shape [batch, tokens]\n", " self.detokenize.get_concrete_function(\n", " tf.TensorSpec(shape=[None, None], dtype=tf.int64))\n", " self.detokenize.get_concrete_function(\n", " tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))\n", "\n", " self.lookup.get_concrete_function(\n", " tf.TensorSpec(shape=[None, None], dtype=tf.int64))\n", " self.lookup.get_concrete_function(\n", " tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))\n", "\n", " # These `get_*` methods take no arguments\n", " self.get_vocab_size.get_concrete_function()\n", " self.get_vocab_path.get_concrete_function()\n", " self.get_reserved_tokens.get_concrete_function()\n", " \n", " @tf.function\n", " def tokenize(self, strings):\n", " enc = self.tokenizer.tokenize(strings)\n", " # Merge the `word` and `word-piece` axes.\n", " enc = enc.merge_dims(-2,-1)\n", " enc = add_start_end(enc)\n", " return enc\n", "\n", " @tf.function\n", " def detokenize(self, tokenized):\n", " words = self.tokenizer.detokenize(tokenized)\n", " return cleanup_text(self._reserved_tokens, words)\n", "\n", " @tf.function\n", " def lookup(self, token_ids):\n", " return tf.gather(self.vocab, token_ids)\n", "\n", " @tf.function\n", " def get_vocab_size(self):\n", " return tf.shape(self.vocab)[0]\n", "\n", " @tf.function\n", " def get_vocab_path(self):\n", " return self._vocab_path\n", "\n", " @tf.function\n", " def get_reserved_tokens(self):\n", " return tf.constant(self._reserved_tokens)" ] }, { "cell_type": "markdown", "metadata": { "id": "RHzEnTQM6nBD" }, "source": [ "Build a `CustomTokenizer` for each language:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:34.287910Z", "iopub.status.busy": "2023-08-11T11:10:34.287468Z", "iopub.status.idle": "2023-08-11T11:10:36.617150Z", "shell.execute_reply": "2023-08-11T11:10:36.616443Z" }, "id": "cU8yFBCSruz4" }, "outputs": [], "source": [ "tokenizers = tf.Module()\n", "tokenizers.pt = CustomTokenizer(reserved_tokens, 'pt_vocab.txt')\n", "tokenizers.en = CustomTokenizer(reserved_tokens, 'en_vocab.txt')" ] }, { "cell_type": "markdown", "metadata": { "id": "ZYfrmDhy6syT" }, "source": [ "Export the tokenizers as a `saved_model`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:36.621401Z", "iopub.status.busy": "2023-08-11T11:10:36.620912Z", "iopub.status.idle": "2023-08-11T11:10:38.869823Z", "shell.execute_reply": "2023-08-11T11:10:38.869113Z" }, "id": "aieDGooa9ms7" }, "outputs": [], "source": [ "model_name = 'ted_hrlr_translate_pt_en_converter'\n", "tf.saved_model.save(tokenizers, model_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "XoCMz2Fm61v6" }, "source": [ "Reload the `saved_model` and test the methods:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:38.874346Z", "iopub.status.busy": "2023-08-11T11:10:38.873757Z", "iopub.status.idle": "2023-08-11T11:10:39.621687Z", "shell.execute_reply": "2023-08-11T11:10:39.621100Z" }, "id": "9SB_BHwqsHkb" }, "outputs": [ { "data": { "text/plain": [ "7010" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reloaded_tokenizers = tf.saved_model.load(model_name)\n", "reloaded_tokenizers.en.get_vocab_size().numpy()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:39.625499Z", "iopub.status.busy": "2023-08-11T11:10:39.624852Z", "iopub.status.idle": "2023-08-11T11:10:39.923428Z", "shell.execute_reply": "2023-08-11T11:10:39.922661Z" }, "id": "W_Ze3WL3816x" }, "outputs": [ { "data": { "text/plain": [ "array([[ 2, 4006, 2358, 687, 1192, 2365, 4, 3]])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = reloaded_tokenizers.en.tokenize(['Hello TensorFlow!'])\n", "tokens.numpy()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:39.927142Z", "iopub.status.busy": "2023-08-11T11:10:39.926503Z", "iopub.status.idle": "2023-08-11T11:10:39.955316Z", "shell.execute_reply": "2023-08-11T11:10:39.954642Z" }, "id": "v9o93bzcuhyC" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_tokens = reloaded_tokenizers.en.lookup(tokens)\n", "text_tokens" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:39.958507Z", "iopub.status.busy": "2023-08-11T11:10:39.958038Z", "iopub.status.idle": "2023-08-11T11:10:40.092706Z", "shell.execute_reply": "2023-08-11T11:10:40.091993Z" }, "id": "Y0205N_8dDT5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello tensorflow !\n" ] } ], "source": [ "round_trip = reloaded_tokenizers.en.detokenize(tokens)\n", "\n", "print(round_trip.numpy()[0].decode('utf-8'))" ] }, { "cell_type": "markdown", "metadata": { "id": "pSKFDQoBjnNp" }, "source": [ "Archive it for the [translation tutorials](https://tensorflow.org/text/tutorials/transformer):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:40.096200Z", "iopub.status.busy": "2023-08-11T11:10:40.095722Z", "iopub.status.idle": "2023-08-11T11:10:40.324144Z", "shell.execute_reply": "2023-08-11T11:10:40.323179Z" }, "id": "eY0SoE3Yj2it" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " adding: ted_hrlr_translate_pt_en_converter/ (stored 0%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/variables/ (stored 0%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/variables/variables.data-00000-of-00001 (deflated 51%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/variables/variables.index (deflated 33%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/assets/ (stored 0%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/assets/en_vocab.txt (deflated 54%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/assets/pt_vocab.txt (deflated 57%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/saved_model.pb (deflated 91%)\r\n", " adding: ted_hrlr_translate_pt_en_converter/fingerprint.pb (stored 0%)\r\n" ] } ], "source": [ "!zip -r {model_name}.zip {model_name}" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:40.328461Z", "iopub.status.busy": "2023-08-11T11:10:40.327855Z", "iopub.status.idle": "2023-08-11T11:10:40.516972Z", "shell.execute_reply": "2023-08-11T11:10:40.516101Z" }, "id": "0Synq0RekAXe" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "168K\tted_hrlr_translate_pt_en_converter.zip\r\n" ] } ], "source": [ "!du -h *.zip" ] }, { "cell_type": "markdown", "metadata": { "id": "AtmGkGBuGHa2" }, "source": [ "\n", "\n", "## Optional: The algorithm\n", "\n", "\n", "It's worth noting here that there are two versions of the WordPiece algorithm: Bottom-up and top-down. In both cases goal is the same: \"Given a training corpus and a number of desired\n", "tokens D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the\n", "number of wordpieces when segmented according to the chosen wordpiece model.\"\n", "\n", "The original [bottom-up WordPiece algorithm](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf), is based on [byte-pair encoding](https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10). Like BPE, It starts with the alphabet, and iteratively combines common bigrams to form word-pieces and words.\n", "\n", "TensorFlow Text's vocabulary generator follows the top-down implementation from [BERT](https://arxiv.org/pdf/1810.04805.pdf). Starting with words and breaking them down into smaller components until they hit the frequency threshold, or can't be broken down further. The next section describes this in detail. For Japanese, Chinese and Korean this top-down approach doesn't work since there are no explicit word units to start with. For those you need a [different approach](https://tfhub.dev/google/zh_segmentation/1).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FLA2QhffYEo0" }, "source": [ "### Choosing the vocabulary\n", "\n", "The top-down WordPiece generation algorithm takes in a set of (word, count) pairs and a threshold `T`, and returns a vocabulary `V`.\n", "\n", "The algorithm is iterative. It is run for `k` iterations, where typically `k = 4`, but only the first two are really important. The third and fourth (and beyond) are just identical to the second. Note that each step of the binary search runs the algorithm from scratch for `k` iterations.\n", "\n", "The iterations described below:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZqfY0p3PYIKr" }, "source": [ "#### First iteration\n", "\n", "1. Iterate over every word and count pair in the input, denoted as `(w, c)`.\n", "2. For each word `w`, generate every substring, denoted as `s`. E.g., for the\n", " word `human`, we generate `{h, hu, hum, huma,\n", " human, ##u, ##um, ##uma, ##uman, ##m, ##ma, ##man, #a, ##an, ##n}`.\n", "3. Maintain a substring-to-count hash map, and increment the count of each `s`\n", " by `c`. E.g., if we have `(human, 113)` and `(humas, 3)` in our input, the\n", " count of `s = huma` will be `113+3=116`.\n", "4. Once we've collected the counts of every substring, iterate over the `(s,\n", " c)` pairs *starting with the longest `s` first*.\n", "5. Keep any `s` that has a `c > T`. E.g., if `T = 100` and we have `(pers,\n", " 231); (dogs, 259); (##rint; 76)`, then we would keep `pers` and `dogs`.\n", "6. When an `s` is kept, subtract off its count from all of its prefixes. This\n", " is the reason for sorting all of the `s` by length in step 4. This is a\n", " critical part of the algorithm, because otherwise words would be double\n", " counted. For example, let's say that we've kept `human` and we get to\n", " `(huma, 116)`. We know that `113` of those `116` came from `human`, and `3`\n", " came from `humas`. However, now that `human` is in our vocabulary, we know\n", " we will never segment `human` into `huma ##n`. So once `human` has been\n", " kept, then `huma` only has an *effective* count of `3`.\n", "\n", "This algorithm will generate a set of word pieces `s` (many of which will be\n", "whole words `w`), which we *could* use as our WordPiece vocabulary.\n", "\n", "However, there is a problem: This algorithm will severely overgenerate word\n", "pieces. The reason is that we only subtract off counts of prefix tokens.\n", "Therefore, if we keep the word `human`, we will subtract off the count for `h,\n", "hu, hu, huma`, but not for `##u, ##um, ##uma, ##uman` and so on. So we might\n", "generate both `human` and `##uman` as word pieces, even though `##uman` will\n", "never be applied.\n", "\n", "So why not subtract off the counts for every *substring*, not just every\n", "*prefix*? Because then we could end up subtracting off the counts multiple\n", "times. Let's say that we're processing `s` of length 5 and we keep both\n", "`(##denia, 129)` and `(##eniab, 137)`, where `65` of those counts came from the\n", "word `undeniable`. If we subtract off from *every* substring, we would subtract\n", "`65` from the substring `##enia` twice, even though we should only subtract\n", "once. However, if we only subtract off from prefixes, it will correctly only be\n", "subtracted once." ] }, { "cell_type": "markdown", "metadata": { "id": "NNCtKR8xT9wX" }, "source": [ "#### Second (and third ...) iteration\n", "\n", "To solve the overgeneration issue mentioned above, we perform multiple\n", "iterations of the algorithm.\n", "\n", "Subsequent iterations are identical to the first, with one important\n", "distinction: In step 2, instead of considering *every* substring, we apply the\n", "WordPiece tokenization algorithm using the vocabulary from the previous\n", "iteration, and only consider substrings which *start* on a split point.\n", "\n", "For example, let's say that we're performing step 2 of the algorithm and\n", "encounter the word `undeniable`. In the first iteration, we would consider every\n", "substring, e.g., `{u, un, und, ..., undeniable, ##n, ##nd, ..., ##ndeniable,\n", "...}`.\n", "\n", "Now, for the second iteration, we will only consider a subset of these. Let's\n", "say that after the first iteration, the relevant word pieces are:\n", "\n", "`un, ##deni, ##able, ##ndeni, ##iable`\n", "\n", "The WordPiece algorithm will segment this into `un ##deni ##able` (see the\n", "section [Applying WordPiece](#applying-wordpiece) for more information). In this\n", "case, we will only consider substrings that *start* at a segmentation point. We\n", "will still consider every possible *end* position. So during the second\n", "iteration, the set of `s` for `undeniable` is:\n", "\n", "`{u, un, und, unden, undeni, undenia, undeniab, undeniabl,\n", "undeniable, ##d, ##de, ##den, ##deni, ##denia, ##deniab, ##deniabl\n", ", ##deniable, ##a, ##ab, ##abl, ##able}`\n", "\n", "The algorithm is otherwise identical. In this example, in the first iteration,\n", "the algorithm produces the spurious tokens `##ndeni` and `##iable`. Now, these\n", "tokens are never considered, so they will not be generated by the second\n", "iteration. We perform several iterations just to make sure the results converge\n", "(although there is no literal convergence guarantee).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "AdUkqe84YQA5" }, "source": [ "### Applying WordPiece\n", "\n", "\n", "\n", "Once a WordPiece vocabulary has been generated, we need to be able to apply it\n", "to new data. The algorithm is a simple greedy longest-match-first application.\n", "\n", "For example, consider segmenting the word `undeniable`.\n", "\n", "We first lookup `undeniable` in our WordPiece dictionary, and if it's present,\n", "we're done. If not, we decrement the end point by one character, and repeat,\n", "e.g., `undeniabl`.\n", "\n", "Eventually, we will either find a subtoken in our vocabulary, or get down to a\n", "single character subtoken. (In general, we assume that every character is in our\n", "vocabulary, although this might not be the case for rare Unicode characters. If\n", "we encounter a rare Unicode character that's not in the vocabulary we simply map\n", "the entire word to ``).\n", "\n", "In this case, we find `un` in our vocabulary. So that's our first word piece.\n", "Then we jump to the end of `un` and repeat the processing, e.g., try to find\n", "`##deniable`, then `##deniabl`, etc. This is repeated until we've segmented the\n", "entire word." ] }, { "cell_type": "markdown", "metadata": { "id": "rjRQKQzpYMl2" }, "source": [ "### Intuition\n", "\n", "Intuitively, WordPiece tokenization is trying to satisfy two different\n", "objectives:\n", "\n", "1. Tokenize the data into the *least* number of pieces as possible. It is\n", " important to keep in mind that the WordPiece algorithm does not \"want\" to\n", " split words. Otherwise, it would just split every word into its characters,\n", " e.g., `human -> {h, ##u, ##m, ##a, #n}`. This is one critical thing that\n", " makes WordPiece different from morphological splitters, which will split\n", " linguistic morphemes even for common words (e.g., `unwanted -> {un, want,\n", " ed}`).\n", "\n", "2. When a word does have to be split into pieces, split it into pieces that\n", " have maximal counts in the training data. For example, the reason why the\n", " word `undeniable` would be split into `{un, ##deni, ##able}` rather than\n", " alternatives like `{unde, ##niab, ##le}` is that the counts for `un` and\n", " `##able` in particular will be very high, since these are common prefixes\n", " and suffixes. Even though the count for `##le` must be higher than `##able`,\n", " the low counts of `unde` and `##niab` will make this a less \"desirable\"\n", " tokenization to the algorithm." ] }, { "cell_type": "markdown", "metadata": { "id": "KQZ38Uus-Xv1" }, "source": [ "## Optional: tf.lookup\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "NreDSRmJNG_h" }, "source": [ "If you need access to, or more control over the vocabulary it's worth noting that you can build the lookup table yourself and pass that to `BertTokenizer`.\n", "\n", "When you pass a string, `BertTokenizer` does the following:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:40.521579Z", "iopub.status.busy": "2023-08-11T11:10:40.521287Z", "iopub.status.idle": "2023-08-11T11:10:40.528923Z", "shell.execute_reply": "2023-08-11T11:10:40.528340Z" }, "id": "thAF1DzQOQXl" }, "outputs": [], "source": [ "pt_lookup = tf.lookup.StaticVocabularyTable(\n", " num_oov_buckets=1,\n", " initializer=tf.lookup.TextFileInitializer(\n", " filename='pt_vocab.txt',\n", " key_dtype=tf.string,\n", " key_index = tf.lookup.TextFileIndex.WHOLE_LINE,\n", " value_dtype = tf.int64,\n", " value_index=tf.lookup.TextFileIndex.LINE_NUMBER)) \n", "pt_tokenizer = text.BertTokenizer(pt_lookup)" ] }, { "cell_type": "markdown", "metadata": { "id": "ERY4FYN7O66R" }, "source": [ "Now you have direct access to the lookup table used in the tokenizer." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:40.532362Z", "iopub.status.busy": "2023-08-11T11:10:40.531811Z", "iopub.status.idle": "2023-08-11T11:10:40.539287Z", "shell.execute_reply": "2023-08-11T11:10:40.538647Z" }, "id": "337_DcAMOs6N" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pt_lookup.lookup(tf.constant(['é', 'um', 'uma', 'para', 'não']))" ] }, { "cell_type": "markdown", "metadata": { "id": "BdZ82x5mPDE9" }, "source": [ "You don't need to use a vocabulary file, `tf.lookup` has other initializer options. If you have the vocabulary in memory you can use `lookup.KeyValueTensorInitializer`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "execution": { "iopub.execute_input": "2023-08-11T11:10:40.542840Z", "iopub.status.busy": "2023-08-11T11:10:40.542258Z", "iopub.status.idle": "2023-08-11T11:10:40.555927Z", "shell.execute_reply": "2023-08-11T11:10:40.555329Z" }, "id": "mzkrmO9H-b9i" }, "outputs": [], "source": [ "pt_lookup = tf.lookup.StaticVocabularyTable(\n", " num_oov_buckets=1,\n", " initializer=tf.lookup.KeyValueTensorInitializer(\n", " keys=pt_vocab,\n", " values=tf.range(len(pt_vocab), dtype=tf.int64))) \n", "pt_tokenizer = text.BertTokenizer(pt_lookup)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "subwords_tokenizer.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" } }, "nbformat": 4, "nbformat_minor": 0 }