File size: 65,772 Bytes

91fc1b6

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OiBSu3YkEcoX"
      },
      "source": [
        "Copyright 2024 DeepMind Technologies Limited.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
        "\n",
        "http://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y5OeTiryEcoX"
      },
      "source": [
        "# Fine-tuning the 2B Griffin model with Flax\n",
        "\n",
        "In this tutorial you will learn how to fine-tune the 2B Griffin model for a simple translation task."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5m81VQOqEcoX"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Cloning into 'recurrentgemma'...\n",
            "remote: Enumerating objects: 52, done.\u001b[K\n",
            "remote: Counting objects: 100% (49/49), done.\u001b[K\n",
            "remote: Compressing objects: 100% (47/47), done.\u001b[K\n",
            "remote: Total 52 (delta 16), reused 5 (delta 2), pack-reused 3\u001b[K\n",
            "Receiving objects: 100% (52/52), 74.57 KiB | 1.01 MiB/s, done.\n",
            "Resolving deltas: 100% (16/16), done.\n"
          ]
        }
      ],
      "source": [
        "!git clone https://github.com/google-deepmind/recurrentgemma.git"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "cellView": "form",
        "id": "XpSw-_4EEcoY"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[33mDEPRECATION: git+https://github.com/google-deepmind/recurrentgemma.git#egg=recurrentgemma[jax] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617\u001b[0m\u001b[33m\n",
            "\u001b[0mCollecting recurrentgemma[jax]\n",
            "  Cloning https://github.com/google-deepmind/recurrentgemma.git to /private/var/folders/jx/gld2clwj7sd_q8hd2m6hztcr0000gn/T/pip-install-2c9hrit5/recurrentgemma_54f0084d6e164dc38004db09c24dfacb\n",
            "  Running command git clone --filter=blob:none --quiet https://github.com/google-deepmind/recurrentgemma.git /private/var/folders/jx/gld2clwj7sd_q8hd2m6hztcr0000gn/T/pip-install-2c9hrit5/recurrentgemma_54f0084d6e164dc38004db09c24dfacb\n",
            "  Resolved https://github.com/google-deepmind/recurrentgemma.git to commit 0f5ca57442f17c7309c70b0228fd8e5505cbdaa1\n",
            "  Installing build dependencies ... \u001b[?25ldone\n",
            "\u001b[?25h  Getting requirements to build wheel ... \u001b[?25ldone\n",
            "\u001b[?25h  Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n",
            "\u001b[?25hRequirement already satisfied: numpy<2.0,>=1.21 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from recurrentgemma[jax]) (1.24.4)\n",
            "Requirement already satisfied: einops<0.8.0,>=0.7.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from recurrentgemma[jax]) (0.7.0)\n",
            "Collecting jaxtyping<0.3.0,>=0.2.28\n",
            "  Downloading jaxtyping-0.2.28-py3-none-any.whl (40 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.7/40.7 kB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting absl-py<1.5.0,>=1.4.0\n",
            "  Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m126.5/126.5 kB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting sentencepiece<0.3.0,>=0.2.0\n",
            "  Downloading sentencepiece-0.2.0-cp310-cp310-macosx_11_0_arm64.whl (1.2 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m27.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting orbax-checkpoint==0.5.7\n",
            "  Downloading orbax_checkpoint-0.5.7-py3-none-any.whl (159 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m159.2/159.2 kB\u001b[0m \u001b[31m15.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting jax<0.5.0,>=0.4.23\n",
            "  Downloading jax-0.4.26-py3-none-any.whl (1.9 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.9/1.9 MB\u001b[0m \u001b[31m31.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
            "\u001b[?25hCollecting flax<0.9.0,>=0.8.2\n",
            "  Downloading flax-0.8.2-py3-none-any.whl (686 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m686.8/686.8 kB\u001b[0m \u001b[31m43.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting etils[epath,epy]\n",
            "  Downloading etils-1.7.0-py3-none-any.whl (152 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m152.4/152.4 kB\u001b[0m \u001b[31m18.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: typing_extensions in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from orbax-checkpoint==0.5.7->recurrentgemma[jax]) (4.9.0)\n",
            "Requirement already satisfied: pyyaml in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from orbax-checkpoint==0.5.7->recurrentgemma[jax]) (6.0.1)\n",
            "Collecting tensorstore>=0.1.51\n",
            "  Downloading tensorstore-0.1.56-cp310-cp310-macosx_11_0_arm64.whl (13.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.0/13.0 MB\u001b[0m \u001b[31m14.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting msgpack\n",
            "  Downloading msgpack-1.0.8-cp310-cp310-macosx_11_0_arm64.whl (84 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.9/84.9 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting jaxlib\n",
            "  Downloading jaxlib-0.4.26-cp310-cp310-macosx_11_0_arm64.whl (66.7 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.7/66.7 MB\u001b[0m \u001b[31m32.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: nest_asyncio in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from orbax-checkpoint==0.5.7->recurrentgemma[jax]) (1.6.0)\n",
            "Requirement already satisfied: protobuf in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from orbax-checkpoint==0.5.7->recurrentgemma[jax]) (4.25.2)\n",
            "Collecting optax\n",
            "  Downloading optax-0.2.2-py3-none-any.whl (223 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m223.7/223.7 kB\u001b[0m \u001b[31m29.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: rich>=11.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from flax<0.9.0,>=0.8.2->recurrentgemma[jax]) (13.7.1)\n",
            "Requirement already satisfied: scipy>=1.9 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from jax<0.5.0,>=0.4.23->recurrentgemma[jax]) (1.12.0)\n",
            "Collecting ml-dtypes>=0.2.0\n",
            "  Downloading ml_dtypes-0.4.0-cp310-cp310-macosx_10_9_universal2.whl (390 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m390.9/390.9 kB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting opt-einsum\n",
            "  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m65.5/65.5 kB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting typeguard==2.13.3\n",
            "  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)\n",
            "Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from rich>=11.1->flax<0.9.0,>=0.8.2->recurrentgemma[jax]) (3.0.0)\n",
            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from rich>=11.1->flax<0.9.0,>=0.8.2->recurrentgemma[jax]) (2.17.2)\n",
            "Collecting zipp\n",
            "  Downloading zipp-3.18.1-py3-none-any.whl (8.2 kB)\n",
            "Requirement already satisfied: fsspec in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from etils[epath,epy]->orbax-checkpoint==0.5.7->recurrentgemma[jax]) (2023.10.0)\n",
            "Requirement already satisfied: importlib_resources in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from etils[epath,epy]->orbax-checkpoint==0.5.7->recurrentgemma[jax]) (6.1.2)\n",
            "Collecting chex>=0.1.86\n",
            "  Downloading chex-0.1.86-py3-none-any.whl (98 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.2/98.2 kB\u001b[0m \u001b[31m15.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: toolz>=0.9.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from chex>=0.1.86->optax->flax<0.9.0,>=0.8.2->recurrentgemma[jax]) (0.12.1)\n",
            "Requirement already satisfied: mdurl~=0.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax<0.9.0,>=0.8.2->recurrentgemma[jax]) (0.1.2)\n",
            "Building wheels for collected packages: recurrentgemma\n",
            "  Building wheel for recurrentgemma (pyproject.toml) ... \u001b[?25ldone\n",
            "\u001b[?25h  Created wheel for recurrentgemma: filename=recurrentgemma-0.1.0-py3-none-any.whl size=73483 sha256=fb0155d9d3fe031716dcb26e7c11b10a02f545879b13d6f5286eb200ec90cd86\n",
            "  Stored in directory: /private/var/folders/jx/gld2clwj7sd_q8hd2m6hztcr0000gn/T/pip-ephem-wheel-cache-62nk7qne/wheels/31/37/18/c57f1df6091b661385ab728b959bdfbf2078d9fc7c856899e4\n",
            "Successfully built recurrentgemma\n",
            "Installing collected packages: sentencepiece, zipp, typeguard, opt-einsum, msgpack, ml-dtypes, etils, absl-py, tensorstore, jaxtyping, jaxlib, jax, recurrentgemma, chex, orbax-checkpoint, optax, flax\n",
            "  Attempting uninstall: sentencepiece\n",
            "    Found existing installation: sentencepiece 0.1.99\n",
            "    Uninstalling sentencepiece-0.1.99:\n",
            "      Successfully uninstalled sentencepiece-0.1.99\n",
            "  Attempting uninstall: absl-py\n",
            "    Found existing installation: absl-py 2.1.0\n",
            "    Uninstalling absl-py-2.1.0:\n",
            "      Successfully uninstalled absl-py-2.1.0\n",
            "Successfully installed absl-py-1.4.0 chex-0.1.86 etils-1.7.0 flax-0.8.2 jax-0.4.26 jaxlib-0.4.26 jaxtyping-0.2.28 ml-dtypes-0.4.0 msgpack-1.0.8 opt-einsum-3.3.0 optax-0.2.2 orbax-checkpoint-0.5.7 recurrentgemma-0.1.0 sentencepiece-0.2.0 tensorstore-0.1.56 typeguard-2.13.3 zipp-3.18.1\n",
            "\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
            "\u001b[31mERROR: Could not find a version that satisfies the requirement tensorflow-cpu (from versions: none)\u001b[0m\u001b[31m\n",
            "\u001b[0m\u001b[31mERROR: No matching distribution found for tensorflow-cpu\u001b[0m\u001b[31m\n",
            "\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
            "\u001b[31mERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.\u001b[0m\u001b[31m\n",
            "\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
            "Requirement already satisfied: datasets in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (2.16.1)\n",
            "Requirement already satisfied: pyarrow-hotfix in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (0.6)\n",
            "Requirement already satisfied: xxhash in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (3.4.1)\n",
            "Requirement already satisfied: requests>=2.19.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (2.31.0)\n",
            "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (2023.10.0)\n",
            "Requirement already satisfied: numpy>=1.17 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (1.24.4)\n",
            "Requirement already satisfied: pandas in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (2.2.0)\n",
            "Requirement already satisfied: multiprocess in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (0.70.15)\n",
            "Requirement already satisfied: packaging in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (23.2)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (6.0.1)\n",
            "Requirement already satisfied: huggingface-hub>=0.19.4 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (0.20.3)\n",
            "Requirement already satisfied: filelock in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (3.13.1)\n",
            "Requirement already satisfied: tqdm>=4.62.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (4.66.1)\n",
            "Requirement already satisfied: pyarrow>=8.0.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (15.0.0)\n",
            "Requirement already satisfied: dill<0.3.8,>=0.3.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (0.3.7)\n",
            "Requirement already satisfied: aiohttp in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from datasets) (3.9.1)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (23.2.0)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.1)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.4)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from huggingface-hub>=0.19.4->datasets) (4.9.0)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2023.11.17)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.6)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.3.2)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2.2.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)\n",
            "Requirement already satisfied: tzdata>=2022.7 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from pandas->datasets) (2023.4)\n",
            "Requirement already satisfied: pytz>=2020.1 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from pandas->datasets) (2023.4)\n",
            "Requirement already satisfied: six>=1.5 in /Users/tybalex/.pyenv/versions/3.10.12/envs/new3102/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
            "\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "# @title Installation\n",
        "! pip install 'git+https://github.com/google-deepmind/recurrentgemma.git#egg=recurrentgemma[jax]'\n",
        "! pip install tensorflow-cpu  # Might require a session restart\n",
        "! pip install --user kaggle\n",
        "! pip install datasets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "yWaP_LPoEcoY"
      },
      "outputs": [
        {
          "ename": "ModuleNotFoundError",
          "evalue": "No module named 'tensorflow'",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
            "Cell \u001b[0;32mIn[10], line 20\u001b[0m\n\u001b[1;32m     17\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mrecurrentgemma\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m jax \u001b[38;5;28;01mas\u001b[39;00m recurrentgemma\n\u001b[1;32m     19\u001b[0m \u001b[38;5;66;03m# We will use tensorflow to handle the dataset\u001b[39;00m\n\u001b[0;32m---> 20\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mtensorflow\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mtf\u001b[39;00m\n\u001b[1;32m     21\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mtensorflow_datasets\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mtfds\u001b[39;00m\n",
            "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'tensorflow'"
          ]
        }
      ],
      "source": [
        "# @title Python imports\n",
        "import pathlib\n",
        "from typing import Any, Mapping, Iterator\n",
        "import enum\n",
        "import functools\n",
        "\n",
        "# We import JAX and some related packages.\n",
        "import chex\n",
        "import jax\n",
        "import jax.numpy as jnp\n",
        "import optax\n",
        "\n",
        "\n",
        "\n",
        "# Finally, we import Recurrentgemma.\n",
        "import sentencepiece as spm\n",
        "from recurrentgemma import jax as recurrentgemma\n",
        "\n",
        "# We will use tensorflow to handle the dataset\n",
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iLafhtv3Rg5F"
      },
      "source": [
        "### Downloading the checkpoint\n",
        "\n",
        "To use Griffin's checkpoints, you'll need a Kaggle account and API key. Here's how to get them:\n",
        "\n",
        "1. Visit https://www.kaggle.com/ and create an account.\n",
        "2. Go to your account settings, then the 'API' section.\n",
        "3. Click 'Create new token' to download your key.\n",
        "\n",
        "You will also need to acknowledge the Terms and Conditions of the RecrurrentGemma models on https://www.kaggle.com/models/google/recurrentgemma/ in order to be able to download the model weights and the tokenizer.\n",
        "\n",
        "Then run the cell below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jCZSmEVDVv6O"
      },
      "source": [
        "If everything went well, you should see:\n",
        "```\n",
        "Kaggle credentials set.\n",
        "Kaggle credentials successfully validated.\n",
        "```\n",
        "\n",
        "Now select and download the checkpoint you want to try. The 2b model can fit in memory for fine-tuning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DVgmx04E2ztl"
      },
      "source": [
        "Need to visit the kaggle page and agree to their term."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "RoUb7Shg-bex"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "fatal: destination path 'recurrentg-2b-it' already exists and is not an empty directory.\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/Users/tybalex/.pyenv/versions/3.10.12/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
            "  pid, fd = os.forkpty()\n"
          ]
        }
      ],
      "source": [
        "!git clone https://huggingface.co/yingbei/recurrentg-2b-it\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "1TOdNwcNBhno"
      },
      "outputs": [],
      "source": [
        "VARIANT = '2b-it' # @param ['2b', '2b-it'] {type:\"string\"}\n",
        "weights_dir = pathlib.Path(\"./recurrentg-2b-it\")\n",
        "ckpt_path = weights_dir / VARIANT\n",
        "vocab_path = weights_dir / 'tokenizer.model'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ejQhgtjbEcoY"
      },
      "source": [
        "## Step 1: prepare the dataset\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XeynYJXCEymJ"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "code_sharegpt = load_dataset(\"sanjay920/code74k-sharegpt\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yDhp3v7DFSUd"
      },
      "outputs": [],
      "source": [
        "code_sharegpt[\"train\"][0][\"conversations\"]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jOMGn19rG5JE"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "chat_prefix = \"<start_of_turn>\"\n",
        "chat_suffix = \"<end_of_turn>\"\n",
        "user_role = \"user\\n\"\n",
        "preprocessed_code_sharegpt_data = []\n",
        "for itor in code_sharegpt[\"train\"]:\n",
        "  c = itor[\"conversations\"]\n",
        "  c = json.loads(c)\n",
        "  assert c[-1][\"from\"] == \"gpt\"\n",
        "  assert c[0][\"from\"] == \"human\"\n",
        "  assert len(c) == 2\n",
        "  input = chat_prefix + user_role + c[0][\"value\"] + chat_suffix\n",
        "  output = c[1][\"value\"]\n",
        "  preprocessed_code_sharegpt_data.append({\"input\": input, \"output\": output})\n",
        "\n",
        "print(json.dumps(preprocessed_code_sharegpt_data[0], indent=4))\n",
        "print(len(preprocessed_code_sharegpt_data))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oZSVAbmWVD1q"
      },
      "outputs": [],
      "source": [
        "\n",
        "def load_custom_data(data):\n",
        "    # convert list of dicts to tfds dataset format\n",
        "    def preprocess(item):\n",
        "        # Convert your item here, e.g., tokenize text\n",
        "        return {\n",
        "            'src': item['input'],  # Assume these are already preprocessed\n",
        "            'dst': item['output'],\n",
        "        }\n",
        "\n",
        "    # Create a Dataset from the list of dictionaries\n",
        "    ds = tf.data.Dataset.from_generator(lambda: (preprocess(item) for item in data),\n",
        "                                        output_types={'src': tf.string, 'dst': tf.string})\n",
        "\n",
        "    # Further dataset operations (batching, padding, etc.) go here\n",
        "    # For example, to batch:\n",
        "    # ds = ds.batch(2)\n",
        "\n",
        "    return ds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NYC42hJgEcoY"
      },
      "source": [
        "### Tokenizer\n",
        "\n",
        "Let's start by loading our vocabulary base tokenizer, which we'll construct using the [SentencePiece](https://github.com/google/sentencepiece) library."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "TpyG5YW1EcoY"
      },
      "outputs": [],
      "source": [
        "vocab = spm.SentencePieceProcessor()\n",
        "vocab.Load(str(vocab_path))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ab2MSf-qEcoY"
      },
      "source": [
        "Let's customize `SentencePieceProcessor` for our English-to-French translation task. Since we're fine-tuning the English-only Griffin 2B model, we need a few adjustments:\n",
        "\n",
        "- **Input Prefix**: Adding a common prefix to each input signals the translation task. For example we could go with a prompt like `Translate this into French: [INPUT_SENTENCE]`.\n",
        "\n",
        "- **Translation Start suffix**: We add a suffix at the end of each prompt tells the model exactly when to begin the translation process. A new line should do the job.\n",
        "\n",
        "- **LM Tokens**: Griffin models expect a *beginning of sequence* token at the beginning of each sequence. Similarly, we need to add an *end of sequence* token at the end of each training example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "L9cjK0uxEcoY"
      },
      "outputs": [],
      "source": [
        "class GriffinTokenizer:\n",
        "  \"\"\"Custom wrapper around a SentencePieceProcessor for tensorflow.\"\"\"\n",
        "\n",
        "  def __init__(self, spm_processor: spm.SentencePieceProcessor):\n",
        "    self._spm_processor = spm_processor\n",
        "\n",
        "  @property\n",
        "  def pad_id(self) -> int:\n",
        "    \"\"\"Fast access to the pad id.\"\"\"\n",
        "    return self._spm_processor.pad_id()\n",
        "\n",
        "  def tokenize(\n",
        "      self,\n",
        "      example: str | bytes,\n",
        "      prefix: str = '',\n",
        "      suffix: str = '',\n",
        "      add_eos: bool = True,\n",
        "  ) -> jax.Array:\n",
        "    \"\"\"\n",
        "    Tokenization function.\n",
        "\n",
        "    Args:\n",
        "      example: input string to tokenize.\n",
        "      prefix:  prefix to add to the input string.\n",
        "      suffix:  suffix to add to the input string.\n",
        "      add_eos: if True, add an end of sentence token at the end of the output\n",
        "               sequence.\n",
        "    Returns:\n",
        "      Tokens corresponding to the input string.\n",
        "    \"\"\"\n",
        "    int_list = [self._spm_processor.bos_id()]\n",
        "    int_list.extend(self._spm_processor.EncodeAsIds(prefix + example + suffix))\n",
        "    if add_eos:\n",
        "      int_list.append(self._spm_processor.eos_id())\n",
        "\n",
        "    return jnp.array(int_list, dtype=jnp.int32)\n",
        "\n",
        "  def tokenize_tf_op(\n",
        "      self,\n",
        "      str_tensor: tf.Tensor,\n",
        "      prefix: str = '',\n",
        "      suffix: str = '',\n",
        "      add_eos: bool = True,\n",
        "  ) -> tf.Tensor:\n",
        "    \"\"\"Tensforflow operator for the tokenize function.\"\"\"\n",
        "    encoded = tf.numpy_function(\n",
        "        self.tokenize,\n",
        "        [str_tensor, prefix, suffix, add_eos],\n",
        "        tf.int32)\n",
        "    encoded.set_shape([None])\n",
        "    return encoded\n",
        "\n",
        "  def to_string(self, tokens: jax.Array) -> str:\n",
        "    \"\"\"Convert an array of tokens to a string.\"\"\"\n",
        "    return self._spm_processor.EncodeIds(tokens.tolist())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6xuCVkurEcoY"
      },
      "source": [
        "Now let's try our custom tokenizer on the MTNT dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "xEA-97ioEcoY"
      },
      "outputs": [],
      "source": [
        "def tokenize_source(tokenizer, example: tf.Tensor):\n",
        "  return tokenizer.tokenize_tf_op(\n",
        "      example,\n",
        "      prefix='',\n",
        "      suffix='\\n<start_of_turn>model\\n',\n",
        "      add_eos=False\n",
        "  )\n",
        "def tokenize_destination(tokenizer, example: tf.Tensor):\n",
        "  return tokenizer.tokenize_tf_op(example, add_eos=True)\n",
        "\n",
        "tokenizer = GriffinTokenizer(vocab)\n",
        "# ds = tfds.load(\"mtnt/en-fr\",split=\"train\")\n",
        "\n",
        "# ds = ds.take(2)\n",
        "# for d in ds:\n",
        "#   print(d)\n",
        "\n",
        "ds = load_custom_data(preprocessed_code_sharegpt_data[:2])\n",
        "print(ds)\n",
        "ds = ds.map(lambda x: {\n",
        "    'input': tokenize_source(tokenizer, x['src']),\n",
        "    'output': tokenize_destination(tokenizer, x['dst'])\n",
        "  })\n",
        "ds = ds.as_numpy_iterator()\n",
        "for idx, example in enumerate(ds):\n",
        "  print(f'Example {idx}:')\n",
        "  for key, val in example.items():\n",
        "    print(f'{key}: {val}')\n",
        "  print()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r-x0aTugEcoY"
      },
      "source": [
        "### Data loader\n",
        "\n",
        "We can now wrap everything a build our data loader."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "XwFFs2mDEcoY"
      },
      "outputs": [],
      "source": [
        "@chex.dataclass(frozen=True)\n",
        "class TrainingInput:\n",
        "  # Input tokens given to the model\n",
        "  input_tokens: jax.Array\n",
        "\n",
        "  # A mask that determines which tokens contribute to the target loss\n",
        "  # calculation.\n",
        "  target_mask: jax.Array\n",
        "\n",
        "class DatasetSplit(enum.Enum):\n",
        "  TRAIN = 'train'\n",
        "  VALIDATION = 'valid'\n",
        "\n",
        "\n",
        "class MyDatasetBuilder:\n",
        "  \"\"\"Data loader for the MTNT dataset.\"\"\"\n",
        "\n",
        "  N_ITEMS = {DatasetSplit.TRAIN: 2000, DatasetSplit.VALIDATION: 100}\n",
        "\n",
        "  BUFFER_SIZE_SHUFFLE = 1000\n",
        "  TRANSLATION_PREFIX = ''\n",
        "  TRANSLATION_SUFFIX = '\\n<start_of_turn>model\\n'\n",
        "\n",
        "  def __init__(self,\n",
        "               tokenizer : GriffinTokenizer,\n",
        "               max_seq_len: int):\n",
        "    \"\"\"Constructor.\n",
        "\n",
        "    Args:\n",
        "      tokenizer: Gemma tokenizer to use.\n",
        "      max_seq_len: size of each sequence in a given batch.\n",
        "    \"\"\"\n",
        "    self._tokenizer = tokenizer\n",
        "    self._base_data = {\n",
        "        DatasetSplit.TRAIN: load_custom_data(preprocessed_code_sharegpt_data[:2000]),\n",
        "        DatasetSplit.VALIDATION: load_custom_data(preprocessed_code_sharegpt_data[-100:]),\n",
        "    }\n",
        "    self._max_seq_len = max_seq_len\n",
        "\n",
        "  def _tokenize_source(self, example: tf.Tensor):\n",
        "    \"\"\"Tokenization function for the source.\"\"\"\n",
        "    return self._tokenizer.tokenize_tf_op(\n",
        "        example, prefix=self.TRANSLATION_PREFIX, suffix=self.TRANSLATION_SUFFIX,\n",
        "        add_eos=False\n",
        "    )\n",
        "\n",
        "  def _tokenize_destination(self, example: tf.Tensor):\n",
        "    \"\"\"Tokenization function for the French translation.\"\"\"\n",
        "    return self._tokenizer.tokenize_tf_op(example, add_eos=True)\n",
        "\n",
        "  def _pad_up_to_max_len(self,\n",
        "                         input_tensor: tf.Tensor,\n",
        "                         pad_value: int | bool,\n",
        "                         ) -> tf.Tensor:\n",
        "    \"\"\"Pad the given tensor up to sequence length of a batch.\"\"\"\n",
        "    seq_len = tf.shape(input_tensor)[0]\n",
        "    to_pad = tf.maximum(self._max_seq_len - seq_len, 0)\n",
        "    return tf.pad(\n",
        "        input_tensor, [[0, to_pad]], mode='CONSTANT', constant_values=pad_value,\n",
        "    )\n",
        "\n",
        "  def _to_training_input(\n",
        "      self,\n",
        "      src_tokens: jax.Array,\n",
        "      dst_tokens: jax.Array,\n",
        "  ) -> TrainingInput:\n",
        "    \"\"\"Build a training input from a tuple of source and destination tokens.\"\"\"\n",
        "\n",
        "    # The input sequence fed to the model is simply the concatenation of the\n",
        "    # source and the destination.\n",
        "    tokens = tf.concat([src_tokens, dst_tokens], axis=0)\n",
        "\n",
        "    # We want to prevent the model from updating based on the source (input)\n",
        "    # tokens. To achieve this, we add a target mask to each input.\n",
        "    q_mask = tf.zeros_like(src_tokens, dtype=tf.bool)\n",
        "    a_mask = tf.ones_like(dst_tokens, dtype=tf.bool)\n",
        "    mask = tf.concat([q_mask, a_mask], axis=0)\n",
        "\n",
        "    # If the output tokens sequence is smaller than the target sequence size,\n",
        "    # then we pad it with pad tokens.\n",
        "    tokens = self._pad_up_to_max_len(tokens, self._tokenizer.pad_id)\n",
        "\n",
        "    # We don't want to perform the backward on the pad tokens.\n",
        "    mask = self._pad_up_to_max_len(mask, False)\n",
        "\n",
        "    return TrainingInput(input_tokens=tokens, target_mask=mask)\n",
        "\n",
        "\n",
        "  def get_train_dataset(self, batch_size: int, num_epochs: int):\n",
        "    \"\"\"Build the training dataset.\"\"\"\n",
        "\n",
        "    # Tokenize each sample\n",
        "    ds = self._base_data[DatasetSplit.TRAIN].map(\n",
        "        lambda x : (self._tokenize_source(x['src']),\n",
        "                    self._tokenize_destination(x['dst']))\n",
        "    )\n",
        "    print(ds)\n",
        "\n",
        "    # Convert them to training inputs\n",
        "    ds = ds.map(lambda x, y: self._to_training_input(x, y))\n",
        "\n",
        "    # Remove the samples which are too long\n",
        "    ds = ds.filter(lambda x: tf.shape(x.input_tokens)[0] <= self._max_seq_len)\n",
        "\n",
        "    # Shuffle the dataset\n",
        "    ds = ds.shuffle(buffer_size=self.BUFFER_SIZE_SHUFFLE)\n",
        "\n",
        "    # Repeat if necessary\n",
        "    ds = ds.repeat(num_epochs)\n",
        "\n",
        "    # Build batches\n",
        "    ds = ds.batch(batch_size, drop_remainder=True)\n",
        "    return ds\n",
        "\n",
        "  def get_validation_dataset(self, batch_size: int):\n",
        "    \"\"\"Build the validation dataset.\"\"\"\n",
        "\n",
        "    # Same as the training dataset, but no shuffling and no repetition\n",
        "    ds = self._base_data[DatasetSplit.VALIDATION].map(\n",
        "        lambda x : (self._tokenize_source(x['src']),\n",
        "                    self._tokenize_destination(x['dst']))\n",
        "    )\n",
        "    ds = ds.map(lambda x, y: self._to_training_input(x, y))\n",
        "    ds = ds.filter(lambda x: tf.shape(x.input_tokens)[0] <= self._max_seq_len)\n",
        "    ds = ds.batch(batch_size, drop_remainder=True)\n",
        "    return ds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m-BHqBGBVlei"
      },
      "source": [
        "# backup dataset class"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "daHyZFztVkkE"
      },
      "outputs": [],
      "source": [
        "class MTNTDatasetBuilder:\n",
        "  \"\"\"Data loader for the MTNT dataset.\"\"\"\n",
        "\n",
        "  N_ITEMS = {DatasetSplit.TRAIN: 35_692, DatasetSplit.VALIDATION: 811}\n",
        "\n",
        "  BUFFER_SIZE_SHUFFLE = 10_000\n",
        "  TRANSLATION_PREFIX = 'Translate this into French:\\n'\n",
        "  TRANSLATION_SUFFIX = '\\n'\n",
        "\n",
        "  def __init__(self,\n",
        "               tokenizer : GriffinTokenizer,\n",
        "               max_seq_len: int):\n",
        "    \"\"\"Constructor.\n",
        "\n",
        "    Args:\n",
        "      tokenizer: Gemma tokenizer to use.\n",
        "      max_seq_len: size of each sequence in a given batch.\n",
        "    \"\"\"\n",
        "    self._tokenizer = tokenizer\n",
        "    self._base_data = {\n",
        "        DatasetSplit.TRAIN: tfds.load(\"mtnt/en-fr\",split=\"train\"),\n",
        "        DatasetSplit.VALIDATION: tfds.load(\"mtnt/en-fr\",split=\"valid\"),\n",
        "    }\n",
        "    self._max_seq_len = max_seq_len\n",
        "\n",
        "  def _tokenize_source(self, example: tf.Tensor):\n",
        "    \"\"\"Tokenization function for the source.\"\"\"\n",
        "    return self._tokenizer.tokenize_tf_op(\n",
        "        example, prefix=self.TRANSLATION_PREFIX, suffix=self.TRANSLATION_SUFFIX,\n",
        "        add_eos=False\n",
        "    )\n",
        "\n",
        "  def _tokenize_destination(self, example: tf.Tensor):\n",
        "    \"\"\"Tokenization function for the French translation.\"\"\"\n",
        "    return self._tokenizer.tokenize_tf_op(example, add_eos=True)\n",
        "\n",
        "  def _pad_up_to_max_len(self,\n",
        "                         input_tensor: tf.Tensor,\n",
        "                         pad_value: int | bool,\n",
        "                         ) -> tf.Tensor:\n",
        "    \"\"\"Pad the given tensor up to sequence length of a batch.\"\"\"\n",
        "    seq_len = tf.shape(input_tensor)[0]\n",
        "    to_pad = tf.maximum(self._max_seq_len - seq_len, 0)\n",
        "    return tf.pad(\n",
        "        input_tensor, [[0, to_pad]], mode='CONSTANT', constant_values=pad_value,\n",
        "    )\n",
        "\n",
        "  def _to_training_input(\n",
        "      self,\n",
        "      src_tokens: jax.Array,\n",
        "      dst_tokens: jax.Array,\n",
        "  ) -> TrainingInput:\n",
        "    \"\"\"Build a training input from a tuple of source and destination tokens.\"\"\"\n",
        "\n",
        "    # The input sequence fed to the model is simply the concatenation of the\n",
        "    # source and the destination.\n",
        "    tokens = tf.concat([src_tokens, dst_tokens], axis=0)\n",
        "\n",
        "    # We want to prevent the model from updating based on the source (input)\n",
        "    # tokens. To achieve this, we add a target mask to each input.\n",
        "    q_mask = tf.zeros_like(src_tokens, dtype=tf.bool)\n",
        "    a_mask = tf.ones_like(dst_tokens, dtype=tf.bool)\n",
        "    mask = tf.concat([q_mask, a_mask], axis=0)\n",
        "\n",
        "    # If the output tokens sequence is smaller than the target sequence size,\n",
        "    # then we pad it with pad tokens.\n",
        "    tokens = self._pad_up_to_max_len(tokens, self._tokenizer.pad_id)\n",
        "\n",
        "    # We don't want to perform the backward on the pad tokens.\n",
        "    mask = self._pad_up_to_max_len(mask, False)\n",
        "\n",
        "    return TrainingInput(input_tokens=tokens, target_mask=mask)\n",
        "\n",
        "\n",
        "  def get_train_dataset(self, batch_size: int, num_epochs: int):\n",
        "    \"\"\"Build the training dataset.\"\"\"\n",
        "\n",
        "    # Tokenize each sample\n",
        "    ds = self._base_data[DatasetSplit.TRAIN].map(\n",
        "        lambda x : (self._tokenize_source(x['src']),\n",
        "                    self._tokenize_destination(x['dst']))\n",
        "    )\n",
        "\n",
        "    # Convert them to training inputs\n",
        "    ds = ds.map(lambda x, y: self._to_training_input(x, y))\n",
        "\n",
        "    # Remove the samples which are too long\n",
        "    ds = ds.filter(lambda x: tf.shape(x.input_tokens)[0] <= self._max_seq_len)\n",
        "\n",
        "    # Shuffle the dataset\n",
        "    ds = ds.shuffle(buffer_size=self.BUFFER_SIZE_SHUFFLE)\n",
        "\n",
        "    # Repeat if necessary\n",
        "    ds = ds.repeat(num_epochs)\n",
        "\n",
        "    # Build batches\n",
        "    ds = ds.batch(batch_size, drop_remainder=True)\n",
        "    return ds\n",
        "\n",
        "  def get_validation_dataset(self, batch_size: int):\n",
        "    \"\"\"Build the validation dataset.\"\"\"\n",
        "\n",
        "    # Same as the training dataset, but no shuffling and no repetition\n",
        "    ds = self._base_data[DatasetSplit.VALIDATION].map(\n",
        "        lambda x : (self._tokenize_source(x['src']),\n",
        "                    self._tokenize_destination(x['dst']))\n",
        "    )\n",
        "    ds = ds.map(lambda x, y: self._to_training_input(x, y))\n",
        "    ds = ds.filter(lambda x: tf.shape(x.input_tokens)[0] <= self._max_seq_len)\n",
        "    ds = ds.batch(batch_size, drop_remainder=True)\n",
        "    return ds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WsOYxL8XXSqf"
      },
      "source": [
        "# Try"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_Sq9uC15EcoZ"
      },
      "source": [
        "Let's give it a try."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "bYeduOaNEcoZ"
      },
      "outputs": [],
      "source": [
        "dataset_builder = MyDatasetBuilder(tokenizer, max_seq_len=4000)\n",
        "ds = dataset_builder.get_train_dataset(3, 1)\n",
        "ds = ds.take(2)\n",
        "ds = ds.as_numpy_iterator()\n",
        "for idx, example in enumerate(ds):\n",
        "  print(f'Example {idx}:')\n",
        "  for key, val in example.items():\n",
        "    print(f'{key}: {val}')\n",
        "  print()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_VsT2o6JEcoZ"
      },
      "source": [
        "## Fine tuning Griffin\n",
        "\n",
        "### Getting started\n",
        "\n",
        "First let's load the model. Use the `griffin_lib.GriffinConfig.from_flax_params_or_variables` function to automatically load the correct configuration from a checkpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "VDlfziQVEcoZ"
      },
      "outputs": [],
      "source": [
        "# Load parameters\n",
        "params =  recurrentgemma.load_parameters(ckpt_path, \"single_device\")\n",
        "config = recurrentgemma.GriffinConfig.from_flax_params_or_variables(params)\n",
        "model = recurrentgemma.Griffin(config)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cGbfx6XVEcoZ"
      },
      "source": [
        "Can our model translate French ? Well let's try it out !"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "jWr6Sea_EcoZ"
      },
      "outputs": [],
      "source": [
        "sampler = recurrentgemma.Sampler(model=model, vocab=vocab, params=params)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "S6937NTjEcoZ"
      },
      "outputs": [],
      "source": [
        "output = sampler(\n",
        "  [\"Develop a Python code snippet that generates an abbreviated version of a given full name.\\nname = 'John Smith'\"],\n",
        "  # number of steps performed when generating\n",
        "  total_generation_steps=300,\n",
        ")\n",
        "print(output.text[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Z0CXW4REcoZ"
      },
      "source": [
        "As expected, it didn't work. Let's see if we can get better results by fine-tuning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gxf6gVGCEcoZ"
      },
      "source": [
        "### Model forward and loss function\n",
        "\n",
        "The `Griffin` class inherits from [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/guides/flax_fundamentals/flax_basics.html). It offers two essential methods:\n",
        "\n",
        "- `init`: Initializes the model's parameters.\n",
        "\n",
        "- `apply`: Executes the model's `__call__` function using a given set of parameters.\n",
        "\n",
        "Since are working with pre-trained weights, we won't use the `init` function.\n",
        "\n",
        "With it we can now build the `forward_function` which performs the forward pass and loss computation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "iEcV0XEEEcoZ"
      },
      "outputs": [],
      "source": [
        "def forward_and_loss_fn(\n",
        "    params,\n",
        "    *,\n",
        "    model: recurrentgemma.Griffin,\n",
        "    input_tokens: jax.Array,            # Shape [B, L]\n",
        "    input_mask: jax.Array,              # Shape [B, L]\n",
        "    positions: jax.Array,               # Shape [B, L]\n",
        ") -> jax.Array:\n",
        "  \"\"\"Forward pass and loss function.\n",
        "\n",
        "  Args:\n",
        "    params: model's input parameters.\n",
        "    model: Griffin model to call.\n",
        "    input_tokens: input tokens sequence, shape [B, L].\n",
        "    input_mask: tokens to ignore when computing the loss, shape [B, L].\n",
        "    positions: relative position of each token, shape [B, L].\n",
        "\n",
        "  Returns:\n",
        "    Softmax cross-entropy loss for the next-token prediction task.\n",
        "  \"\"\"\n",
        "  batch_size = input_tokens.shape[0]\n",
        "  # Foward pass on the input data.\n",
        "  # No attention cache is needed here.\n",
        "  # Exclude the last step as it does not appear in the targets.\n",
        "  logits, _ = model.apply(\n",
        "        {\"params\": params},\n",
        "        tokens=input_tokens[:, :-1],\n",
        "        segment_pos=positions[:, :-1],\n",
        "        cache=None,\n",
        "    )\n",
        "\n",
        "  # Similarly, the first token cannot be predicteds.\n",
        "  target_tokens = input_tokens[:, 1:]\n",
        "  target_mask = input_mask[:, 1:]\n",
        "\n",
        "  # Convert the target labels into one-hot encoded vectors.\n",
        "  one_hot = jax.nn.one_hot(target_tokens, logits.shape[-1])\n",
        "\n",
        "  # Don't update on unwanted tokens.\n",
        "  one_hot = one_hot * target_mask.astype(one_hot.dtype)[...,None]\n",
        "\n",
        "  # Normalisation factor.\n",
        "  norm_factor = batch_size * (jnp.sum(target_mask) + 1e-8)\n",
        "\n",
        "  # Return the nll loss.\n",
        "  return -jnp.sum(jax.nn.log_softmax(logits) * one_hot) / norm_factor"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xbxYMMWLEcoZ"
      },
      "source": [
        "We can now build the train_step function which performs the backward pass and updates the model's parameters accordingly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "cPSfp7ZUEcoZ"
      },
      "outputs": [],
      "source": [
        "Params = Mapping[str, Any]\n",
        "\n",
        "def get_positions(example: jax.Array, pad_id : int) -> jax.Array:\n",
        "  \"\"\"Builds the position vector from the given tokens.\"\"\"\n",
        "  pad_mask = example != pad_id\n",
        "  positions = jnp.cumsum(pad_mask, axis=-1)\n",
        "  # Subtract one for all positions from the first valid one as they are\n",
        "  # 0-indexed\n",
        "  positions = positions - (positions >= 1)\n",
        "  return positions\n",
        "\n",
        "@functools.partial(\n",
        "    jax.jit,\n",
        "    static_argnames=['model', 'optimizer'],\n",
        "    donate_argnames=['params', 'opt_state'],\n",
        ")\n",
        "def train_step(\n",
        "    model: recurrentgemma.Griffin,\n",
        "    params: Params,\n",
        "    optimizer: optax.GradientTransformation,\n",
        "    opt_state: optax.OptState,\n",
        "    pad_id: int,\n",
        "    example: TrainingInput,\n",
        ") -> tuple[jax.Array, Params, optax.OptState]:\n",
        "  \"\"\"Train step.\n",
        "\n",
        "  Args:\n",
        "    model: Griffin model.\n",
        "    params: model's input parameters.\n",
        "    optimizer: optax optimizer to use.\n",
        "    opt_state: input optimizer's state.\n",
        "    pad_id: id of the pad token.\n",
        "    example: input batch.\n",
        "\n",
        "  Returns:\n",
        "    Training loss, updated parameters, updated optimizer state.\n",
        "  \"\"\"\n",
        "\n",
        "  positions = get_positions(example.input_tokens, pad_id)\n",
        "\n",
        "  # Forward and backward passes\n",
        "  train_loss, grads = jax.value_and_grad(forward_and_loss_fn)(\n",
        "      params,\n",
        "      model=model,\n",
        "      input_tokens=example.input_tokens,\n",
        "      input_mask=example.target_mask,\n",
        "      positions=positions,\n",
        "  )\n",
        "  # Update the parameters\n",
        "  updates, opt_state = optimizer.update(grads, opt_state, params)\n",
        "  params = optax.apply_updates(params, updates)\n",
        "\n",
        "  return train_loss, params, opt_state"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R2QXp116EcoZ"
      },
      "source": [
        "Similarly, we build a `validation_step` function without backward pass."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "yU4oR92YEcoa"
      },
      "outputs": [],
      "source": [
        "@functools.partial(jax.jit, static_argnames=['model'])\n",
        "def validation_step(\n",
        "    model: recurrentgemma.Griffin,\n",
        "    params: Params,\n",
        "    pad_id: int,\n",
        "    example: TrainingInput,\n",
        ") -> jax.Array:\n",
        "  return forward_and_loss_fn(\n",
        "      params,\n",
        "      model=model,\n",
        "      input_tokens=example.input_tokens,\n",
        "      input_mask=example.target_mask,\n",
        "      positions=get_positions(example.input_tokens, pad_id),\n",
        "  )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6g6LFWJbEcoa"
      },
      "source": [
        "And now the training loop itself."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "xT4bAqNLEcoa"
      },
      "outputs": [],
      "source": [
        "def train_loop(\n",
        "    model: recurrentgemma.Griffin,\n",
        "    params: Params,\n",
        "    optimizer: optax.GradientTransformation,\n",
        "    train_ds: Iterator[TrainingInput],\n",
        "    validation_ds: Iterator[TrainingInput],\n",
        "    num_steps: int | None = None,\n",
        "    eval_every_n: int = 20,\n",
        "):\n",
        "  opt_state = jax.jit(optimizer.init)(params)\n",
        "\n",
        "  step_counter = 0\n",
        "  avg_loss=0\n",
        "\n",
        "  # A first round of validation loss\n",
        "  n_steps_eval = 0\n",
        "  eval_loss = 0\n",
        "  for val_example in validation_ds.as_numpy_iterator():\n",
        "    eval_loss += validation_step(\n",
        "        model, params, dataset_builder._tokenizer.pad_id, val_example\n",
        "    )\n",
        "    n_steps_eval += 1\n",
        "  print(f\"Start, validation loss: {eval_loss/n_steps_eval}\")\n",
        "\n",
        "  for train_example in train_ds:\n",
        "    train_loss, params, opt_state = train_step(\n",
        "        model=model,\n",
        "        params=params,\n",
        "        optimizer=optimizer,\n",
        "        opt_state=opt_state,\n",
        "        pad_id=dataset_builder._tokenizer.pad_id,\n",
        "        example=train_example,\n",
        "    )\n",
        "\n",
        "    step_counter += 1\n",
        "    avg_loss += train_loss\n",
        "    if step_counter % eval_every_n == 0:\n",
        "      eval_loss = 0\n",
        "\n",
        "      n_steps_eval = 0\n",
        "      val_iterator = validation_ds.as_numpy_iterator()\n",
        "      for val_example in val_iterator:\n",
        "        eval_loss += validation_step(\n",
        "            model,\n",
        "            params,\n",
        "            dataset_builder._tokenizer.pad_id,\n",
        "            val_example,\n",
        "        )\n",
        "        n_steps_eval +=1\n",
        "      avg_loss /= eval_every_n\n",
        "      eval_loss /= n_steps_eval\n",
        "      print(f\"STEP {step_counter} training loss: {avg_loss} - eval loss: {eval_loss}\")\n",
        "      avg_loss=0\n",
        "    if num_steps is not None and step_counter > num_steps:\n",
        "      break\n",
        "  return params"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hJAuU6P1dGCl"
      },
      "source": [
        "Here you have to choose an optimizer. For devices with smaller memory (like the T4 GPU) we suggest to use SGD as it has a much lower memory footprint. To achieve best finetuning performance we suggest to try Adam-W. We have provided optimal hyper parameters for each optimizer for the particular task in this notebook for the '2b-it' checkpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oMufclhfc-t4"
      },
      "outputs": [],
      "source": [
        "def griffin_weight_decay_mask(params_like: optax.Params) -> Any:\n",
        "  # Don't put weight decay on the RGLRU, the embeddings and any biases\n",
        "  def enable_weight_decay(path: list[Any], _: Any) -> bool:\n",
        "    # Parameters in the LRU and embedder\n",
        "    path = [dict_key.key for dict_key in path]\n",
        "    if 'rg_lru' in path or 'embedder' in path:\n",
        "      return False\n",
        "    # All biases and scales\n",
        "    if path[-1] in ('b', 'scale'):\n",
        "      return False\n",
        "    return True\n",
        "\n",
        "  return jax.tree_util.tree_map_with_path(enable_weight_decay, params_like)\n",
        "\n",
        "optimizer_choice = \"adamw\" #@param [\"sgd\", \"adamw\"]\n",
        "\n",
        "if optimizer_choice == \"sgd\":\n",
        "  optimizer = optax.sgd(learning_rate=1e-3)\n",
        "  num_steps = 300\n",
        "elif optimizer_choice == \"adamw\":\n",
        "  optimizer = optax.adamw(\n",
        "        learning_rate=1e-4,\n",
        "        b2=0.96,\n",
        "        eps=1e-8,\n",
        "        weight_decay=0.1,\n",
        "        mask=griffin_weight_decay_mask,\n",
        "    )\n",
        "  num_steps = 100\n",
        "  pass\n",
        "else:\n",
        "  raise ValueError(f\"Unknown optimizer: {optimizer_choice}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3tSwzfRdfJ_W"
      },
      "source": [
        "Finally we prepare the training and validation datasets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0KFz-9OcfM9-"
      },
      "outputs": [],
      "source": [
        "# Small seq size so that everything fits in memory\n",
        "num_epochs = 1 #@param {type: \"integer\"}\n",
        "batch_size = 1 #@param {type: \"integer\"}\n",
        "sequence_length = 4000 #@param {type: \"integer\"}\n",
        "\n",
        "# Make the dataset builder\n",
        "tokenizer = GriffinTokenizer(vocab)\n",
        "dataset_builder= MTNTDatasetBuilder(tokenizer, sequence_length + 1)\n",
        "\n",
        "# Build the training dataset\n",
        "train_ds = dataset_builder.get_train_dataset(\n",
        "    batch_size=batch_size,\n",
        "    num_epochs=num_epochs,\n",
        ").as_numpy_iterator()\n",
        "\n",
        "# Build the validation dataset, with a limited number of samples for this demo\n",
        "validation_ds = dataset_builder.get_validation_dataset(\n",
        "    batch_size=batch_size,\n",
        ").take(50)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "muwkf_ZgEcoa"
      },
      "source": [
        "We can now fine-tune our model on a limited number of steps."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vyuWnFY5wSlW"
      },
      "outputs": [],
      "source": [
        "trained_params = train_loop(\n",
        "    model=model,\n",
        "    params=params,\n",
        "    optimizer=optimizer,\n",
        "    train_ds=train_ds,\n",
        "    validation_ds=validation_ds,\n",
        "    num_steps=num_steps,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "abChlybFEcod"
      },
      "source": [
        "Both the training loss and the validation's are going down. But is it working ?\n",
        "\n",
        "Let's try again with our previous example. To ensure our input matches the training format, remember to use the prefix 'Translate this into French:\\n'  and a newline character at the end. This signals the model to begin translation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "S5F3fk22Ecod"
      },
      "outputs": [],
      "source": [
        "sampler.params = trained_params\n",
        "output = sampler(\n",
        "    [\"Translate this into French:\\nHello, my name is Morgane.\\n\"],\n",
        "    total_generation_steps=30,\n",
        ")\n",
        "print(output.text[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FdSF-xoChOPD"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [
        "iLafhtv3Rg5F",
        "m-BHqBGBVlei"
      ],
      "gpuType": "A100",
      "private_outputs": true,
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}