{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Welcome!\n", "to the repo for\n", "\n", "*Learning the Legibility of Visual Text Perturbations* (EACL 2023)\n", "\n", "by Dev Seth, Rickard Stureborg, Danish Pruthi and Bhuwan Dhingra" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A `LEGIT` Introduction\n", "This notebook provides a helpful starting point to interact with the datasets and models presented in the Learning Legibility paper.\n", "\n", "All assets are hosted on the HuggingFace Hub and can be used with the `transformers` and `datasets` libraries: \n", " - TrOCR-MT Model: https://huggingface.co/dvsth/LEGIT-TrOCR-MT \n", " - LEGIT Dataset: https://huggingface.co/datasets/dvsth/LEGIT\n", " - Perturbed Jigsaw Dataset: https://huggingface.co/datasets/dvsth/LEGIT-VIPER-Jigsaw-Toxic-Comment-Perturbed" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**For an interactive preview of the perturbation process and legibility assessment model, run `demo.py` using the command `python demo.py` (will open a browser-based interface). The demo allows you to perturb a word with your chosen attack parameters, then see the model's legibility estimate for the generated perturbations.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# external imports -- use pip or conda to install these packages\n", "import torch\n", "from transformers import TrOCRProcessor, AutoModel, TrainingArguments\n", "from datasets import load_dataset\n", "\n", "# local imports\n", "from classes.LegibilityModel import LegibilityModel\n", "from classes.Trainer import MultiTaskTrainer\n", "from classes.Metrics import binary_classification_metric, ranking_metric" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading the Model and Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# load the model schema and pretrained weights\n", "# (this may take some time to download)\n", "model = AutoModel.from_pretrained(\"dvsth/LEGIT-TrOCR-MT\", revision='main', trust_remote_code=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interactive dataset preview available [here](https://huggingface.co/datasets/dvsth/LEGIT/viewer/dvsth--LEGIT/test)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using custom data configuration dvsth--LEGIT-d84a4d72774d3652\n", "Found cached dataset parquet (/Users/dvsth/.cache/huggingface/datasets/dvsth___parquet/dvsth--LEGIT-d84a4d72774d3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "22f8a468229a4760bd2829ef894a5472", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00