{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLP Powered GIF search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the [Tumblr GIF Description Dataset](http://raingo.github.io/TGIF-Release/), which contains over 100k animated GIFs and 120K sentences describing its visual content. Using this data with a *vector database* and *retriever* we are able to create an NLP-powered GIF search tool.\n", "\n", "There are a few packages that must be installed for this notebook to run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U pandas pinecone-client sentence-transformers tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We must also set the following notebook parameters to display the GIF images we will be working with." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "r8KN-iWWdwby" }, "outputs": [], "source": [ "from IPython.display import HTML\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": { "id": "KFIZrga-6Jq_" }, "source": [ "## Download and Extract Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First let's download and extract the dataset. The dataset is available [here](https://github.com/raingo/TGIF-Release) on GitHub. We can use the link below to download the dataset directly. We can also access the link from a browser to directly download the files." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ZD4gusO9YB1-", "outputId": "2d69fa61-f67a-45c8-ecc4-4d1c9b06f7cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-08-13 16:40:35-- https://github.com/raingo/TGIF-Release/archive/master.zip\n", "Resolving github.com (github.com)... 140.82.114.4\n", "Connecting to github.com (github.com)|140.82.114.4|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://codeload.github.com/raingo/TGIF-Release/zip/refs/heads/master [following]\n", "--2022-08-13 16:40:35-- https://codeload.github.com/raingo/TGIF-Release/zip/refs/heads/master\n", "Resolving codeload.github.com (codeload.github.com)... 140.82.114.10\n", "Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [application/zip]\n", "Saving to: ‘master.zip’\n", "\n", "master.zip [ <=> ] 11.82M 6.59MB/s in 1.8s \n", "\n", "2022-08-13 16:40:37 (6.59 MB/s) - ‘master.zip’ saved [12396861]\n", "\n" ] } ], "source": [ "# Use wget to download the master.zip file which contains the dataset\n", "!wget https://github.com/raingo/TGIF-Release/archive/master.zip" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "qLvXp0RtYTTz" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: master.zip\n", "3e54d2f71418d8a2e9f5f61aa5be0edb9c0ac2b8\n", " creating: TGIF-Release-master/\n", " inflating: TGIF-Release-master/.gitignore \n", " inflating: TGIF-Release-master/.gitmodules \n", " inflating: TGIF-Release-master/LICENSE \n", " inflating: TGIF-Release-master/README.md \n", " creating: TGIF-Release-master/code/\n", " inflating: TGIF-Release-master/code/README.md \n", " creating: TGIF-Release-master/code/crowdflower/\n", " extracting: TGIF-Release-master/code/crowdflower/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/.gitmodules \n", " inflating: TGIF-Release-master/code/crowdflower/README.md \n", " creating: TGIF-Release-master/code/crowdflower/back-end/\n", " inflating: TGIF-Release-master/code/crowdflower/back-end/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/deploy.sh \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/entity_extract.py \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/eval.py \n", " creating: TGIF-Release-master/code/crowdflower/back-end/logs/\n", " extracting: TGIF-Release-master/code/crowdflower/back-end/logs/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/logs/logging.conf \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/requirements.txt \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/routes.py \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/start_nlp_sever.sh \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/swear-words \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/test-data.sorted \n", " creating: TGIF-Release-master/code/crowdflower/front-end/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/.gitignore \n", " creating: TGIF-Release-master/code/crowdflower/front-end/data/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/data/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/gen_test_cases.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/notify.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/parse-res.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/pipeline.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-judgments.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-rest.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-test.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/set-diff.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/shuffle-test.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/update-test.py \n", " creating: TGIF-Release-master/code/crowdflower/front-end/layout/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/layout/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/forgive.js \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/instructions.md \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/main.html \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/main.js \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/view.css \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/\n", " extracting: TGIF-Release-master/code/crowdflower/table3-rating/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/requirements.txt \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/routes.py \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/static/\n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/static/view.css \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/templates/\n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/templates/_formhelper.html \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/templates/submit.html \n", " creating: TGIF-Release-master/code/gif2txt-lstm/\n", " inflating: TGIF-Release-master/code/gif2txt-lstm/README.md \n", " inflating: TGIF-Release-master/code/gif2txt-lstm/caffe-rnn.patch \n", " creating: TGIF-Release-master/code/gif2txt-lstm/models/\n", " inflating: TGIF-Release-master/code/gif2txt-lstm/models/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/\n", " extracting: TGIF-Release-master/code/gifs-filter/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/.gitmodules \n", " inflating: TGIF-Release-master/code/gifs-filter/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/adult-filter/\n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/filter.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/gen-raw.sh \n", " creating: TGIF-Release-master/code/gifs-filter/adult-filter/keywords/\n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/keywords/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/parse-tsv.py \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/cluster-by-tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/dump-tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter-images.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter-text.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter_tags.py \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/giftypes/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/giftypes/c3d-models-rfc.pkl \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/no-motion/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/no-motion/c3d-models-rfc.pkl \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/predict.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/rank_tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/setdiff.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/setinter.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/tag_rules \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/train.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d.patch \n", " creating: TGIF-Release-master/code/gifs-filter/c3d/\n", " extracting: TGIF-Release-master/code/gifs-filter/c3d/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/agg_feat.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/build_deploy.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/c3d.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/deploy.prototxt.in \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/\n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/agg-hash.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/build.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/cluster-pairs.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dedup-v2.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dedup.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dump-nd.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/extract_hash.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/extract_mhhash.cpp \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/filter-cluster.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/match-hash.sh \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/mih/\n", " extracting: TGIF-Release-master/code/gifs-filter/dedup/mih/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/pHash-0.9.6/\n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/pHash-0.9.6/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/store-hash.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/email_notify.py \n", " inflating: TGIF-Release-master/code/gifs-filter/full.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/gen_set.py \n", " inflating: TGIF-Release-master/code/gifs-filter/monitor-api.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/monitor.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/pipeline.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/prepare-data.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/requirements.txt \n", " inflating: TGIF-Release-master/code/gifs-filter/review-CF.py \n", " inflating: TGIF-Release-master/code/gifs-filter/split-batches.sh \n", " creating: TGIF-Release-master/code/gifs-filter/test/\n", " extracting: TGIF-Release-master/code/gifs-filter/test/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/test/gif.urls \n", " creating: TGIF-Release-master/code/gifs-filter/text-score/\n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/Makefile \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/debug.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/filter.hpp \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/group_area.hpp \n", " creating: TGIF-Release-master/code/gifs-filter/text-score/test/\n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/benchmark.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/neg.urls \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/pos.urls \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/text-score.cpp \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/text-score.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/textdetection.cpp \n", " creating: TGIF-Release-master/data/\n", " extracting: TGIF-Release-master/data/.gitignore \n", " creating: TGIF-Release-master/data/GIF2Movie/\n", " inflating: TGIF-Release-master/data/GIF2Movie/M-VAD.tsv \n", " inflating: TGIF-Release-master/data/GIF2Movie/MPII-MD.tsv \n", " inflating: TGIF-Release-master/data/README.md \n", " creating: TGIF-Release-master/data/coco-caption/\n", " inflating: TGIF-Release-master/data/eval.py \n", " inflating: TGIF-Release-master/data/results-lstm-cnn-finetune-cvpr16.tsv \n", " creating: TGIF-Release-master/data/splits/\n", " extracting: TGIF-Release-master/data/splits/.gitignore \n", " inflating: TGIF-Release-master/data/splits/test.txt \n", " inflating: TGIF-Release-master/data/splits/train.txt \n", " inflating: TGIF-Release-master/data/splits/val.txt \n", " inflating: TGIF-Release-master/data/tgif-v1.0.tsv \n", " creating: TGIF-Release-master/docs/\n", " creating: TGIF-Release-master/docs/_includes/\n", " inflating: TGIF-Release-master/docs/_includes/authors.html \n", " inflating: TGIF-Release-master/docs/_includes/download.html \n", " inflating: TGIF-Release-master/docs/_includes/examples.html \n", " inflating: TGIF-Release-master/docs/_includes/footer.html \n", " inflating: TGIF-Release-master/docs/_includes/head.html \n", " inflating: TGIF-Release-master/docs/_includes/header.html \n", " inflating: TGIF-Release-master/docs/_includes/nav.html \n", " inflating: TGIF-Release-master/docs/_includes/overview.html \n", " creating: TGIF-Release-master/docs/_layouts/\n", " inflating: TGIF-Release-master/docs/_layouts/default.html \n", " creating: TGIF-Release-master/docs/css/\n", " inflating: TGIF-Release-master/docs/css/main.scss \n", " extracting: TGIF-Release-master/docs/index.html \n", " creating: TGIF-Release-master/docs/js/\n", " inflating: TGIF-Release-master/docs/js/main.js \n" ] } ], "source": [ "# Use unzip to extract the master.zip file\n", "!unzip master.zip" ] }, { "cell_type": "markdown", "metadata": { "id": "7agJKFkZ6UGB" }, "source": [ "## Explore the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's explore the downloaded files. The data we want is in *tgif-v1.0.tsv* file in the *data* folder. We can use *pandas* library to open the file. We need to set delimiter as `\\t` as the file contains tab separated values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "1rwBQ3I2Ye7c" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "K8RvBSYSbvUb" }, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "description | \n", "
---|---|---|
0 | \n", "https://38.media.tumblr.com/9f6c25cc350f12aa74... | \n", "a man is glaring, and someone with sunglasses ... | \n", "
1 | \n", "https://38.media.tumblr.com/9ead028ef62004ef6a... | \n", "a cat tries to catch a mouse on a tablet | \n", "
2 | \n", "https://38.media.tumblr.com/9f43dc410be85b1159... | \n", "a man dressed in red is dancing. | \n", "
3 | \n", "https://38.media.tumblr.com/9f659499c8754e40cf... | \n", "an animal comes close to another in the jungle | \n", "
4 | \n", "https://38.media.tumblr.com/9ed1c99afa7d714118... | \n", "a man in a hat adjusts his tie and makes a wei... | \n", "