{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Welcome!\n",
    "to the repo for\n",
    "\n",
    "*Learning the Legibility of Visual Text Perturbations* (EACL 2023)\n",
    "\n",
    "by Dev Seth, Rickard Stureborg, Danish Pruthi and Bhuwan Dhingra"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A `LEGIT` Introduction\n",
    "This notebook provides a helpful starting point to interact with the datasets and models presented in the Learning Legibility paper.\n",
    "\n",
    "All assets are hosted on the HuggingFace Hub and can be used with the `transformers` and `datasets` libraries: \n",
    "  - TrOCR-MT Model: https://huggingface.co/dvsth/LEGIT-TrOCR-MT \n",
    "  - LEGIT Dataset: https://huggingface.co/datasets/dvsth/LEGIT\n",
    "  - Perturbed Jigsaw Dataset: https://huggingface.co/datasets/dvsth/LEGIT-VIPER-Jigsaw-Toxic-Comment-Perturbed"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**For an interactive preview of the perturbation process and legibility assessment model, run `demo.py` using the command `python demo.py` (will open a browser-based interface). The demo allows you to perturb a word with your chosen attack parameters, then see the model's legibility estimate for the generated perturbations.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# external imports -- use pip or conda to install these packages\n",
    "import torch\n",
    "from transformers import TrOCRProcessor, AutoModel, TrainingArguments\n",
    "from datasets import load_dataset\n",
    "\n",
    "# local imports\n",
    "from classes.LegibilityModel import LegibilityModel\n",
    "from classes.Trainer import MultiTaskTrainer\n",
    "from classes.Metrics import binary_classification_metric, ranking_metric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading the Model and Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the model schema and pretrained weights\n",
    "# (this may take some time to download)\n",
    "model = AutoModel.from_pretrained(\"dvsth/LEGIT-TrOCR-MT\", revision='main', trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interactive dataset preview available [here](https://huggingface.co/datasets/dvsth/LEGIT/viewer/dvsth--LEGIT/test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using custom data configuration dvsth--LEGIT-d84a4d72774d3652\n",
      "Found cached dataset parquet (/Users/dvsth/.cache/huggingface/datasets/dvsth___parquet/dvsth--LEGIT-d84a4d72774d3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "22f8a468229a4760bd2829ef894a5472",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "dataset = load_dataset('dvsth/LEGIT').with_format('torch')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training/Eval Loop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Trainer setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.\n"
     ]
    }
   ],
   "source": [
    "# preprocessor provides image normalization and resizing\n",
    "preprocessor = TrOCRProcessor.from_pretrained(\n",
    "    \"microsoft/trocr-base-handwritten\")\n",
    "\n",
    "# apply preprocessing batch-wise\n",
    "def collate_fn(data):\n",
    "    return {\n",
    "        'choice': torch.tensor([d['choice'].item() for d in data]),\n",
    "        'img0': preprocessor([d['img0'] for d in data], return_tensors='pt')['pixel_values'],\n",
    "        'img1': preprocessor([d['img1'] for d in data], return_tensors='pt')['pixel_values']\n",
    "    }\n",
    "\n",
    "\n",
    "train_args = TrainingArguments(\n",
    "    output_dir=f'runs',             # change this to a unique path for each run, e.g. f'runs/{run_id}'\n",
    "    overwrite_output_dir=True,\n",
    "    num_train_epochs=5,             # we found 3 epochs to be sufficient for convergence on the base models\n",
    "    per_device_train_batch_size=26, # fits on 1 x NVIDIA A6000, 48GB VRAM\n",
    "    per_device_eval_batch_size=26,  # can be increased to 32\n",
    "    gradient_accumulation_steps=2,  # increase this to fit on a smaller GPU\n",
    "    warmup_steps=0,             \n",
    "    weight_decay=0.0,\n",
    "    learning_rate=1e-5,             # we found this to be the best initial learning rate for the base models\n",
    "    save_strategy=\"steps\",\n",
    "    save_steps=200,\n",
    "    eval_steps=200,\n",
    "    evaluation_strategy=\"steps\",\n",
    "    logging_strategy='steps',\n",
    "    logging_steps=50,\n",
    "    fp16=False,                     \n",
    "    load_best_model_at_end=True,    # load the best model at the end of training based on validation F1\n",
    "    metric_for_best_model='f1_score')\n",
    "\n",
    "trainer = MultiTaskTrainer(\n",
    "    model=model,\n",
    "    compute_metrics=binary_classification_metric, # check out metrics.py for a list of metrics\n",
    "    args=train_args,\n",
    "    data_collator=collate_fn,\n",
    "    train_dataset=dataset['train'],\n",
    "    eval_dataset=dataset['valid'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Generate predictions and compute metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter 'indices'=range(0, 100) of the transform datasets.arrow_dataset.Dataset.select couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n",
      "The following columns in the test set don't have a corresponding argument in `LegibilityModel.forward` and have been ignored: k, word1, word0, n, model1, word, n1, k1, model0. If k, word1, word0, n, model1, word, n1, k1, model0 are not expected by `LegibilityModel.forward`,  you can safely ignore this message.\n",
      "***** Running Prediction *****\n",
      "  Num examples = 100\n",
      "  Batch size = 26\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1e9f7a59e7624c129d2dae1c85c2d11a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'test_loss': 0.5344929695129395, 'test_precision': 0.9479166567925349, 'test_recall': 0.8921568539984622, 'test_accuracy': 0.8787878721303949, 'test_f1_score': 0.9191914103665608, 'test_runtime': 47.6671, 'test_samples_per_second': 2.098, 'test_steps_per_second': 0.084}\n"
     ]
    }
   ],
   "source": [
    "predictions = trainer.predict(dataset['test'].select(range(100))) # takes ~1-2 minutes on a laptop CPU\n",
    "print(predictions.metrics)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "a9a33fd02dcd74fd53701f10c0433ded41be0a0f53c9699722a73f690e69c2bc"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}