{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "albert_glue_fine_tuning_tutorial",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "TPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y8SJfpgTccDB",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "<a href=\"https://colab.research.google.com/github/google-research/albert/blob/master/albert_glue_fine_tuning_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wHQH4OCHZ9bq",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "# @title Copyright 2020 The ALBERT Authors. All Rights Reserved.\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License.\n",
        "# =============================================================================="
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rkTLZ3I4_7c_",
        "colab_type": "text"
      },
      "source": [
        "# ALBERT End to End (Fine-tuning + Predicting) with Cloud TPU"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1wtjs1QDb3DX",
        "colab_type": "text"
      },
      "source": [
        "## Overview\n",
        "\n",
        "ALBERT is \"A Lite\" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation.\n",
        "\n",
        "For a technical description of the algorithm, see our paper:\n",
        "\n",
        "https://arxiv.org/abs/1909.11942\n",
        "\n",
        "Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut\n",
        "\n",
        "This Colab demonstates using a free Colab Cloud TPU to fine-tune GLUE tasks built on top of pretrained ALBERT models and \n",
        "run predictions on tuned model. The colab demonsrates loading pretrained ALBERT models from both [TF Hub](https://www.tensorflow.org/hub) and checkpoints.\n",
        "\n",
        "**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud \n",
        "Storage) bucket for this Colab to run.\n",
        "\n",
        "Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.\n",
        "\n",
        "This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook, select **File > View on GitHub**."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ld-JXlueIuPH",
        "colab_type": "text"
      },
      "source": [
        "## Instructions"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POkof5uHaQ_c",
        "colab_type": "text"
      },
      "source": [
        "<h3><a href=\"https://cloud.google.com/tpu/\"><img valign=\"middle\" src=\"https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png\" width=\"50\"></a>  &nbsp;&nbsp;Train on TPU</h3>\n",
        "\n",
        "   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the \"Parameters\" section below.\n",
        " \n",
        "   1. On the main menu, click Runtime and select **Change runtime type**. Set \"TPU\" as the hardware accelerator.\n",
        "   1. Click Runtime again and select **Runtime > Run All** (Watch out: the \"Colab-only auth for this notebook and the TPU\" cell requires user input). You can also run the cells manually with Shift-ENTER."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UdMmwCJFaT8F",
        "colab_type": "text"
      },
      "source": [
        "### Set up your TPU environment\n",
        "\n",
        "In this section, you perform the following tasks:\n",
        "\n",
        "*   Set up a Colab TPU running environment\n",
        "*   Verify that you are connected to a TPU device\n",
        "*   Upload your credentials to TPU to access your GCS bucket."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "191zq3ZErihP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# TODO(lanzhzh): Add support for 2.x.\n",
        "%tensorflow_version 1.x\n",
        "import os\n",
        "import pprint\n",
        "import json\n",
        "import tensorflow as tf\n",
        "\n",
        "assert \"COLAB_TPU_ADDR\" in os.environ, \"ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!\"\n",
        "TPU_ADDRESS = \"grpc://\" + os.environ[\"COLAB_TPU_ADDR\"] \n",
        "TPU_TOPOLOGY = \"2x2\"\n",
        "print(\"TPU address is\", TPU_ADDRESS)\n",
        "\n",
        "from google.colab import auth\n",
        "auth.authenticate_user()\n",
        "with tf.Session(TPU_ADDRESS) as session:\n",
        "  print('TPU devices:')\n",
        "  pprint.pprint(session.list_devices())\n",
        "\n",
        "  # Upload credentials to TPU.\n",
        "  with open('/content/adc.json', 'r') as f:\n",
        "    auth_info = json.load(f)\n",
        "  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)\n",
        "    # Now credentials are set for all future sessions on this TPU."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HUBP35oCDmbF",
        "colab_type": "text"
      },
      "source": [
        "### Prepare and import ALBERT modules\n",
        "​\n",
        "With your environment configured, you can now prepare and import the ALBERT modules. The following step clones the source code from GitHub."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7wzwke0sxS6W",
        "colab_type": "code",
        "colab": {},
        "cellView": "code"
      },
      "source": [
        "#TODO(lanzhzh): Add pip support\n",
        "import sys\n",
        "\n",
        "!test -d albert || git clone https://github.com/google-research/albert albert\n",
        "if not 'albert' in sys.path:\n",
        "  sys.path += ['albert']\n",
        "  \n",
        "!pip install sentencepiece\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RRu1aKO1D7-Z",
        "colab_type": "text"
      },
      "source": [
        "### Prepare for training\n",
        "\n",
        "This next section of code performs the following tasks:\n",
        "\n",
        "*  Specify GS bucket, create output directory for model checkpoints and eval results.\n",
        "*  Specify task and download training data.\n",
        "*  Specify ALBERT pretrained model\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tYkaAlJNfhul",
        "colab_type": "code",
        "colab": {},
        "cellView": "form"
      },
      "source": [
        "# Please find the full list of tasks and their fintuning hyperparameters\n",
        "# here https://github.com/google-research/albert/blob/master/run_glue.sh\n",
        "\n",
        "BUCKET = \"albert_tutorial_glue\" #@param { type: \"string\" }\n",
        "TASK = 'MRPC' #@param {type:\"string\"}\n",
        "# Available pretrained model checkpoints:\n",
        "#   base, large, xlarge, xxlarge\n",
        "ALBERT_MODEL = 'base' #@param {type:\"string\"}\n",
        "\n",
        "TASK_DATA_DIR = 'glue_data'\n",
        "\n",
        "BASE_DIR = \"gs://\" + BUCKET\n",
        "if not BASE_DIR or BASE_DIR == \"gs://\":\n",
        "  raise ValueError(\"You must enter a BUCKET.\")\n",
        "DATA_DIR = os.path.join(BASE_DIR, \"data\")\n",
        "MODELS_DIR = os.path.join(BASE_DIR, \"models\")\n",
        "OUTPUT_DIR = 'gs://{}/albert-tfhub/models/{}'.format(BUCKET, TASK)\n",
        "tf.gfile.MakeDirs(OUTPUT_DIR)\n",
        "print('***** Model output directory: {} *****'.format(OUTPUT_DIR))\n",
        "\n",
        "# Download glue data.\n",
        "! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo\n",
        "!python download_glue_repo/download_glue_data.py --data_dir=$TASK_DATA_DIR --tasks=$TASK\n",
        "print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))\n",
        "\n",
        "ALBERT_MODEL_HUB = 'https://tfhub.dev/google/albert_' + ALBERT_MODEL + '/3'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Hcpfl4N2EdOk",
        "colab_type": "text"
      },
      "source": [
        "Now let's run the fine-tuning scripts. If you use the default MRPC task, this should be finished in around 10 mintues and you will get an accuracy of around 86.5."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "o8qXPxv8-kBO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR\n",
        "!python -m albert.run_classifier \\\n",
        "  --data_dir=\"glue_data/\" \\\n",
        "  --output_dir=$OUTPUT_DIR \\\n",
        "  --albert_hub_module_handle=$ALBERT_MODEL_HUB \\\n",
        "  --spm_model_file=\"from_tf_hub\" \\\n",
        "  --do_train=True \\\n",
        "  --do_eval=True \\\n",
        "  --do_predict=False \\\n",
        "  --max_seq_length=512 \\\n",
        "  --optimizer=adamw \\\n",
        "  --task_name=$TASK \\\n",
        "  --warmup_step=200 \\\n",
        "  --learning_rate=2e-5 \\\n",
        "  --train_step=800 \\\n",
        "  --save_checkpoints_steps=100 \\\n",
        "  --train_batch_size=32 \\\n",
        "  --tpu_name=$TPU_ADDRESS \\\n",
        "  --use_tpu=True"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}