{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLP Powered GIF search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the [Tumblr GIF Description Dataset](http://raingo.github.io/TGIF-Release/), which contains over 100k animated GIFs and 120K sentences describing its visual content. Using this data with a *vector database* and *retriever* we are able to create an NLP-powered GIF search tool.\n", "\n", "There are a few packages that must be installed for this notebook to run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U pandas pinecone-client sentence-transformers tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We must also set the following notebook parameters to display the GIF images we will be working with." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "r8KN-iWWdwby" }, "outputs": [], "source": [ "from IPython.display import HTML\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": { "id": "KFIZrga-6Jq_" }, "source": [ "## Download and Extract Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First let's download and extract the dataset. The dataset is available [here](https://github.com/raingo/TGIF-Release) on GitHub. We can use the link below to download the dataset directly. We can also access the link from a browser to directly download the files." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ZD4gusO9YB1-", "outputId": "2d69fa61-f67a-45c8-ecc4-4d1c9b06f7cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-08-13 16:40:35-- https://github.com/raingo/TGIF-Release/archive/master.zip\n", "Resolving github.com (github.com)... 140.82.114.4\n", "Connecting to github.com (github.com)|140.82.114.4|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://codeload.github.com/raingo/TGIF-Release/zip/refs/heads/master [following]\n", "--2022-08-13 16:40:35-- https://codeload.github.com/raingo/TGIF-Release/zip/refs/heads/master\n", "Resolving codeload.github.com (codeload.github.com)... 140.82.114.10\n", "Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [application/zip]\n", "Saving to: ‘master.zip’\n", "\n", "master.zip [ <=> ] 11.82M 6.59MB/s in 1.8s \n", "\n", "2022-08-13 16:40:37 (6.59 MB/s) - ‘master.zip’ saved [12396861]\n", "\n" ] } ], "source": [ "# Use wget to download the master.zip file which contains the dataset\n", "!wget https://github.com/raingo/TGIF-Release/archive/master.zip" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "qLvXp0RtYTTz" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: master.zip\n", "3e54d2f71418d8a2e9f5f61aa5be0edb9c0ac2b8\n", " creating: TGIF-Release-master/\n", " inflating: TGIF-Release-master/.gitignore \n", " inflating: TGIF-Release-master/.gitmodules \n", " inflating: TGIF-Release-master/LICENSE \n", " inflating: TGIF-Release-master/README.md \n", " creating: TGIF-Release-master/code/\n", " inflating: TGIF-Release-master/code/README.md \n", " creating: TGIF-Release-master/code/crowdflower/\n", " extracting: TGIF-Release-master/code/crowdflower/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/.gitmodules \n", " inflating: TGIF-Release-master/code/crowdflower/README.md \n", " creating: TGIF-Release-master/code/crowdflower/back-end/\n", " inflating: TGIF-Release-master/code/crowdflower/back-end/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/deploy.sh \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/entity_extract.py \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/eval.py \n", " creating: TGIF-Release-master/code/crowdflower/back-end/logs/\n", " extracting: TGIF-Release-master/code/crowdflower/back-end/logs/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/logs/logging.conf \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/requirements.txt \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/routes.py \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/start_nlp_sever.sh \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/swear-words \n", " inflating: TGIF-Release-master/code/crowdflower/back-end/test-data.sorted \n", " creating: TGIF-Release-master/code/crowdflower/front-end/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/.gitignore \n", " creating: TGIF-Release-master/code/crowdflower/front-end/data/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/data/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/gen_test_cases.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/notify.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/parse-res.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/pipeline.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-judgments.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-rest.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/review-test.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/set-diff.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/shuffle-test.py \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/data/update-test.py \n", " creating: TGIF-Release-master/code/crowdflower/front-end/layout/\n", " extracting: TGIF-Release-master/code/crowdflower/front-end/layout/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/forgive.js \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/instructions.md \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/main.html \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/main.js \n", " inflating: TGIF-Release-master/code/crowdflower/front-end/layout/view.css \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/\n", " extracting: TGIF-Release-master/code/crowdflower/table3-rating/.gitignore \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/requirements.txt \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/routes.py \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/static/\n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/static/view.css \n", " creating: TGIF-Release-master/code/crowdflower/table3-rating/templates/\n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/templates/_formhelper.html \n", " inflating: TGIF-Release-master/code/crowdflower/table3-rating/templates/submit.html \n", " creating: TGIF-Release-master/code/gif2txt-lstm/\n", " inflating: TGIF-Release-master/code/gif2txt-lstm/README.md \n", " inflating: TGIF-Release-master/code/gif2txt-lstm/caffe-rnn.patch \n", " creating: TGIF-Release-master/code/gif2txt-lstm/models/\n", " inflating: TGIF-Release-master/code/gif2txt-lstm/models/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/\n", " extracting: TGIF-Release-master/code/gifs-filter/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/.gitmodules \n", " inflating: TGIF-Release-master/code/gifs-filter/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/adult-filter/\n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/filter.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/gen-raw.sh \n", " creating: TGIF-Release-master/code/gifs-filter/adult-filter/keywords/\n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/keywords/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/adult-filter/parse-tsv.py \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/cluster-by-tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/dump-tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter-images.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter-text.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/filter_tags.py \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/giftypes/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/giftypes/c3d-models-rfc.pkl \n", " creating: TGIF-Release-master/code/gifs-filter/c3d-models/no-motion/\n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/no-motion/c3d-models-rfc.pkl \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/predict.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/rank_tags.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/setdiff.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/setinter.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/tag_rules \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d-models/train.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d.patch \n", " creating: TGIF-Release-master/code/gifs-filter/c3d/\n", " extracting: TGIF-Release-master/code/gifs-filter/c3d/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/agg_feat.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/build_deploy.py \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/c3d.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/c3d/deploy.prototxt.in \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/\n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/agg-hash.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/build.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/cluster-pairs.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dedup-v2.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dedup.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/dump-nd.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/extract_hash.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/extract_mhhash.cpp \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/filter-cluster.py \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/match-hash.sh \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/mih/\n", " extracting: TGIF-Release-master/code/gifs-filter/dedup/mih/README.md \n", " creating: TGIF-Release-master/code/gifs-filter/dedup/pHash-0.9.6/\n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/pHash-0.9.6/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/dedup/store-hash.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/email_notify.py \n", " inflating: TGIF-Release-master/code/gifs-filter/full.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/gen_set.py \n", " inflating: TGIF-Release-master/code/gifs-filter/monitor-api.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/monitor.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/pipeline.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/prepare-data.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/requirements.txt \n", " inflating: TGIF-Release-master/code/gifs-filter/review-CF.py \n", " inflating: TGIF-Release-master/code/gifs-filter/split-batches.sh \n", " creating: TGIF-Release-master/code/gifs-filter/test/\n", " extracting: TGIF-Release-master/code/gifs-filter/test/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/test/gif.urls \n", " creating: TGIF-Release-master/code/gifs-filter/text-score/\n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/.gitignore \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/Makefile \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/debug.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/filter.hpp \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/group_area.hpp \n", " creating: TGIF-Release-master/code/gifs-filter/text-score/test/\n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/README.md \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/benchmark.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/neg.urls \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/test/pos.urls \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/text-score.cpp \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/text-score.sh \n", " inflating: TGIF-Release-master/code/gifs-filter/text-score/textdetection.cpp \n", " creating: TGIF-Release-master/data/\n", " extracting: TGIF-Release-master/data/.gitignore \n", " creating: TGIF-Release-master/data/GIF2Movie/\n", " inflating: TGIF-Release-master/data/GIF2Movie/M-VAD.tsv \n", " inflating: TGIF-Release-master/data/GIF2Movie/MPII-MD.tsv \n", " inflating: TGIF-Release-master/data/README.md \n", " creating: TGIF-Release-master/data/coco-caption/\n", " inflating: TGIF-Release-master/data/eval.py \n", " inflating: TGIF-Release-master/data/results-lstm-cnn-finetune-cvpr16.tsv \n", " creating: TGIF-Release-master/data/splits/\n", " extracting: TGIF-Release-master/data/splits/.gitignore \n", " inflating: TGIF-Release-master/data/splits/test.txt \n", " inflating: TGIF-Release-master/data/splits/train.txt \n", " inflating: TGIF-Release-master/data/splits/val.txt \n", " inflating: TGIF-Release-master/data/tgif-v1.0.tsv \n", " creating: TGIF-Release-master/docs/\n", " creating: TGIF-Release-master/docs/_includes/\n", " inflating: TGIF-Release-master/docs/_includes/authors.html \n", " inflating: TGIF-Release-master/docs/_includes/download.html \n", " inflating: TGIF-Release-master/docs/_includes/examples.html \n", " inflating: TGIF-Release-master/docs/_includes/footer.html \n", " inflating: TGIF-Release-master/docs/_includes/head.html \n", " inflating: TGIF-Release-master/docs/_includes/header.html \n", " inflating: TGIF-Release-master/docs/_includes/nav.html \n", " inflating: TGIF-Release-master/docs/_includes/overview.html \n", " creating: TGIF-Release-master/docs/_layouts/\n", " inflating: TGIF-Release-master/docs/_layouts/default.html \n", " creating: TGIF-Release-master/docs/css/\n", " inflating: TGIF-Release-master/docs/css/main.scss \n", " extracting: TGIF-Release-master/docs/index.html \n", " creating: TGIF-Release-master/docs/js/\n", " inflating: TGIF-Release-master/docs/js/main.js \n" ] } ], "source": [ "# Use unzip to extract the master.zip file\n", "!unzip master.zip" ] }, { "cell_type": "markdown", "metadata": { "id": "7agJKFkZ6UGB" }, "source": [ "## Explore the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's explore the downloaded files. The data we want is in *tgif-v1.0.tsv* file in the *data* folder. We can use *pandas* library to open the file. We need to set delimiter as `\\t` as the file contains tab separated values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "1rwBQ3I2Ye7c" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "K8RvBSYSbvUb" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urldescription
0https://38.media.tumblr.com/9f6c25cc350f12aa74...a man is glaring, and someone with sunglasses ...
1https://38.media.tumblr.com/9ead028ef62004ef6a...a cat tries to catch a mouse on a tablet
2https://38.media.tumblr.com/9f43dc410be85b1159...a man dressed in red is dancing.
3https://38.media.tumblr.com/9f659499c8754e40cf...an animal comes close to another in the jungle
4https://38.media.tumblr.com/9ed1c99afa7d714118...a man in a hat adjusts his tie and makes a wei...
\n", "
" ], "text/plain": [ " url \\\n", "0 https://38.media.tumblr.com/9f6c25cc350f12aa74... \n", "1 https://38.media.tumblr.com/9ead028ef62004ef6a... \n", "2 https://38.media.tumblr.com/9f43dc410be85b1159... \n", "3 https://38.media.tumblr.com/9f659499c8754e40cf... \n", "4 https://38.media.tumblr.com/9ed1c99afa7d714118... \n", "\n", " description \n", "0 a man is glaring, and someone with sunglasses ... \n", "1 a cat tries to catch a mouse on a tablet \n", "2 a man dressed in red is dancing. \n", "3 an animal comes close to another in the jungle \n", "4 a man in a hat adjusts his tie and makes a wei... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load dataset to a pandas dataframe\n", "df = pd.read_csv(\n", " \"./TGIF-Release-master/data/tgif-v1.0.tsv\",\n", " delimiter=\"\\t\",\n", " names=['url', 'description']\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note the dataset does not contain the actual GIF files. But it has URLs we can use to download/access the GIF files. This is great as we do not need to store/download all the GIF files. We can directly load the required GIF files using the URL when displaying the search results.*\n", "\n", "There are some duplicate descriptions in the dataset." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "125782" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "102068" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of *unique* GIFs in the dataset\n", "len(df[\"url\"].unique())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "https://38.media.tumblr.com/ddbfe51aff57fd8446f49546bc027bd7/tumblr_nowv0v6oWj1uwbrato1_500.gif 4\n", "https://33.media.tumblr.com/46c873a60bb8bd97bdc253b826d1d7a1/tumblr_nh7vnlXEvL1u6fg3no1_500.gif 4\n", "https://38.media.tumblr.com/b544f3c87cbf26462dc267740bb1c842/tumblr_n98uooxl0K1thiyb6o1_250.gif 4\n", "https://33.media.tumblr.com/88235b43b48e9823eeb3e7890f3d46ef/tumblr_nkg5leY4e21sof15vo1_500.gif 4\n", "https://31.media.tumblr.com/69bca8520e1f03b4148dde2ac78469ec/tumblr_npvi0kW4OD1urqm0mo1_400.gif 4\n", "Name: url, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dupes = df['url'].value_counts().sort_values(ascending=False)\n", "dupes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at one of these duplicated URLs and it's descriptions." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "two girls are singing music pop in a concert\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "a woman sings sang girl on a stage singing\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "two girls on a stage sing into microphones.\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "two girls dressed in black are singing.\n" ] } ], "source": [ "dupe_url = \"https://33.media.tumblr.com/88235b43b48e9823eeb3e7890f3d46ef/tumblr_nkg5leY4e21sof15vo1_500.gif\"\n", "dupe_df = df[df['url'] == dupe_url]\n", "\n", "# let's take a look at this GIF and it's duplicated descriptions\n", "for _, gif in dupe_df.iterrows():\n", " HTML(f\"\")\n", " print(gif[\"description\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no reason for us to remove these duplicates, as shown here, every description is accurate. You can spot check a few of the other URLs but they all seem to be the same where we have several *accurate* descriptions for a single GIF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That leaves us with 125,781 descriptions for 102,067 GIFs. We will use these descriptions to create *context* vectors that will be indexed in a vector database to create our GIF search tool. Let's take a look at a few more examples of GIFs and their descriptions." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 577 }, "id": "m0_jfDW6hl4C", "outputId": "bcfb0ae3-4c44-4354-e42d-93a3ee35ff2d" }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "a man is glaring, and someone with sunglasses appears.\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "a cat tries to catch a mouse on a tablet\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "a man dressed in red is dancing.\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "an animal comes close to another in the jungle\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "a man in a hat adjusts his tie and makes a weird face.\n" ] } ], "source": [ "for _, gif in df[:5].iterrows():\n", " HTML(f\"\")\n", " print(gif[\"description\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the description of the GIF accurately describes what is happening in the GIF, we can use these descriptions to search through our GIFs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this data, we can build the GIF search tool with just *two* components:\n", "\n", "* a **retriever** to embed GIF descriptions\n", "* a **vector database** to store GIF description embeddings and retrieve relevant GIFs" ] }, { "cell_type": "markdown", "metadata": { "id": "zrKIRGeo6ehR" }, "source": [ "## Initialize Pinecone Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vector database stores vector representations of our GIF descriptions which we can retrieve using another vector (query vector). We will use the Pinecone vector database, a fully managed vector database that can store and search through billions of records in milliseconds. You could use any other vector database such as FAISS to build this tool. But you may need to manage the database yourself.\n", "\n", "To initialize the database, we sign up for a [free Pinecone API key](https://app.pinecone.io/) and `pip install pinecone-client`. Once ready, we initialize our index with:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "Ngbs8wQQoePL" }, "outputs": [], "source": [ "import pinecone\n", "\n", "# Connect to pinecone environment\n", "pinecone.init(\n", " api_key=\"<>\",\n", " environment=\"us-west1-gcp\"\n", ")\n", "\n", "index_name = 'gif-search'\n", "\n", "# check if the gif-search exists\n", "if index_name not in pinecone.list_indexes():\n", " # create the index if it does not exist\n", " pinecone.create_index(\n", " index_name,\n", " dimension=384,\n", " metric=\"cosine\"\n", " )\n", "\n", "# Connect to gif-search index we created\n", "index = pinecone.Index(index_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we specify the name of the index where we will store our GIF descriptions and their URLs, the similarity metric, and the embedding dimension of the vectors. The similarity metric and embedding dimension can change depending on the embedding model used. However, most retrievers use \"cosine\" and 768." ] }, { "cell_type": "markdown", "metadata": { "id": "D5mGU3ub6kkb" }, "source": [ "## Initialize Retriever" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to initialize our retriever. The retriever will mainly do two things:\n", "\n", "1.\tGenerate embeddings for all the GIF descriptions (context vectors/embeddings)\n", "2.\tGenerate embeddings for the query (query vector/embedding)\n", "\n", "The retriever will generate the embeddings in a way that the queries and GIF descriptions with similar meanings are in a similar vector space. Then we can use cosine similarity to calculate this similarity between the query and context embeddings and find the most relevant GIF to our query.\n", "\n", "We will use a `SentenceTransformer` model trained based on Microsoft's MPNet as our retriever. This model performs well out-of-the-box when searching based on generic semantic similarity. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "jtqu5O9Y6q8x" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-08-13 16:42:37.258365: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0\n" ] } ], "source": [ "from sentence_transformers import SentenceTransformer" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 465, "referenced_widgets": [ "3bbdb7ac9ddb4e61acbf10e0e322b464", "425825b26a384e158608f34c327e7be7", "9df1cc108ad74d84a52b25ca0e835197", "d9e3f2ddbf5e47ebb86fb20c39079354", "fe02dbbb561147198c3278388cb40d04", "02851f413fa34d12b412e036d785d38e", "1215630e8b53423390353cd56a044c6e", "beeffeb083574f0a8c702b6035474073", "a40d7b1c14944323937291f6a834061b", "2063122bcf15496987749c8cca733a8b", "02796d5819bd4b77b56c6f9cab93c908", "b69d8608edce436c826f87e269a3b1ec", "5e30b575e33843e9b68bc34a84c6e6b2", "a6941924044b4956afa6c9b458141007", "a2d42ce089df49b896ba9223f6df2ac2", "b329d7774a1a4817a4e9cea7adfb288d", "4bfd979f8e1a4bd889b846d6affb6d3f", "d2088a6432ed4403b9602c5fecb09246", "071f6d3e71694ddab08360cabf5e7ecd", "f8c2899ce8264b7fa7d5552783c71ad3", "850040d37e2b45ecaa53c82958dd9d6a", "08779212fc3c46629ea7b5ae778b282b", "97b11b05410f4b0682a223efd6eeb570", "b5732b4221f448a8b8ae2363849214eb", "c7c897d4905f4af28a4e0e8e72897f74", "9066e6a9f5384ceaba6d58353c3e0a6a", "5085c2de59b34ffe8902285e42f32401", "8387420e74e94768893a7415243d4a12", "b3fa9e20d0524f0bbbe65575ccd75126", "24d25ddd24084a49b47f02138db4b592", "b4e62b032b884838b46bcc6b62c0c3ca", "a54f0fc3dc184e20bf7660ddbf890a76", "b66d0aedfb414d4c8cac45e0f8d157ac", "d6e028d4612241fba005c9dd3fe12d30", "a765f4490ae44f959bc594bb3b3b719a", "0496b4ff5eec45f08c5a1fa1ef106777", "976ba55ccfc0419d9b357d83eb4e4ab9", "de5a4a6c3495413fa200ab92d876f509", "4fb7a30442f94068bdb7c9e434cd03b6", "242c1203c62b42b9ae32af653267f8c7", "780b68a97f184b8cb7a04e3412d050ab", "f6685ce8e5a843f0b856c64a5b71c7ee", "6d8f66edc20543b9baae3b474201861e", "7d4faf89e1a34c3592a8d8b5c332cf9a", "aaff54fb4ed64fbba356616ebad675c3", "1b85eb15f8584fd5afceddd1ea3ea2e0", "ba4b82c10e804d3ea76592f2d2a3832c", "6d2b705a90a447579864e91618f70676", "a0d67c99dee248f280cbe34fc3de48a6", "42ca2987c03f46e9985a98b4a137f95a", "3d044c94cbd741adbf826306bd882971", "9a612735a6b94182bafe9a0b42648455", "3965cf39aef641dbb2d0b2a362e1c8c2", "369f1ee2f1844327b3074a31c0e12519", "0e1839f9d1fc43b18f84d0f554a54cc2", "272faddd67724961a8bdcccba6b196b0", "922e13c3b6944e65a04d96147dd3c9ad", "aac3d96ceb444486b6f7f760b43ffc79", "04a7db113b11418ab275879f0f8bf162", "fd337d66d3a14867ae76908cd58e313b", "462563ff268e47d5a8c9efffe09604ed", "abbefd5572204fb4b3c0bf7c5597eb8e", "69e75d6d62494209b2ee828d1710a59e", "45240e5ad3a6426bbfc5e3c1880ac781", "67dfd1d8ff4043239f31bb3a53ce4ccc", "6aa46bc05e6b4a139009a4124b0f80ec", "9ab75ac0bf404362ab25e898b172110c", "005d8360dc824d1f869d0eca3b3ec9fd", "aa556238d9174c8e83187d079f229b97", "65e7a017905442ae98c6415bcdfe0bb4", "97302900f6604d7991df8c9854d95728", "0ebe49597de647a5a4da64d03c471b76", "40e00bdafe664e6082d5fc4392cbc4bc", "32719d55aa0e49d7bb65e7755fc2f572", "24eb84536c324805bf159da81d8ad509", "6565c0393fee4725abb04e245a9a6fb0", "2fe6b2e1c56b48fe9b056ee8d0d3697c", "e6e9f5860acc4fdda68eafe6abbc99db", "7f84fd3a40bb40dbb742101403e671c9", "48e79feebdc043adae21f571319493db", "2d07707632dc4c60bca7afbbeb928ca6", "71600400fa14406bb81e2e410d116681", "2540fac951cd4883833b586558c9081e", "759f82584cc84eedbbf8fb6142b0d657", "693abe44ba9a463ca9d08e1b6c54673b", "024be013b90d4a309b2f54d979f1269a", "0097b88603d8424f8fc1696b17bd165e", "a1d9893a7c7345a49cd336610e0ba3f5", "fefd20a4306b4dad840a79038db4c0a3", "c985a33ed0064c589e78a72080c382e7", "4e37e65238634a97a91e8449e24a99c2", "0bb8029a6b2e49de98586fa091033387", "fd631c1a5515475a873dceb33e21ac7f", "2b86f4f2dbf34220ab607cb010ad3ef7", "8da6bd57b0f44a65be01086f33313f2a", "24f2ae769e2c4d829dffa14ca8928d3b", "1e64de324773474d994258dbe9ed736f", "c7f564bd62324a048511ff82b8d4c369", "71e6d768200b48ddab9375d0266c6756", "fbd414ab7af64c2f840d19d1315085c4", "f2983bd01cab40469306b9bdea3b3b19", "825ae4cc8891473f8feb1b2820b0c0aa", "8d8f963d9b6c409d8c21e307958013d8", "8f2d54ebde5341808e65a705f15ba036", "1f7356b83d474a0f8ed49a4a5fae5d22", "55e099fba0fb41f999f690884f80843a", "2f0edbcbec1d44378f4bc17535a87b76", "6c709e1a7d174633b603df7bf0804756", "8b88e836a5884b3db7748d5971463e62", "9cb993dbaa5746249fed2df7d8db677c", "87b1f938065644cba745c0b0952526b5", "9d46acccbf1347c1bfd7b61e84591812", "8f240b5dc5274d63bb23ba9979d47163", "783b444819ab4454bceb16578c71df39", "b1c876108e6b4150bf3638fad1a56157", "6afdd5eede004470a8c7eb8709f6182b", "f4ae6887074f443299e95e6cebf971ce", "b3fbdd74fd764c65a25817690c961de6", "d141eee9a0f14702a82cfe1b194f9d89", "eeffd24564624122afc795be6bd1131b", "8075cad347314feabdb19d6902752b27", "b3c2f3d5b805460496a74f2009aaf81c", "4247478ac3b742a7a9407bc69ed179e6", "d3abd49739b2493b8d52a062e90ae3d6", "af1cdb5fdb31467aa54bc5b9c1ec255d", "4124c49ef7d9497187b8e29af694d54f", "1e7f1c9aa28e4dcd80f6b9c6b9c06fd6", "73c159f67bec448aa102470f1abd0ca7", "393b7df64cbd4072acf090f3dee2108c", "f2c52012dfd84e39851e297d56669213", "665cc77211d0466cbb2d4a2a95547547", "70ca64c48bad4c42ba0208eecfc3e40e", "176a01c7622143c38482371fe2f6cfaf", "3bf92cfae4b547ce9f91e6b4dd94f25a", "493c065a83ad48fc921b5b11c59c8fed", "c0b2ebd2a44845e885863ee18a4e121e", "c070c05c63ef48908126fac009b33ea4", "33c35634d79f4848a32e6bd46cb1a75b", "c0caab84bfcb4bb69ed70c73e5d20b59", "95fa10da3a1742ea9fb3ae3bf948618a", "00ba06783c7b484f91aceefd8be27782", "8b9224015a1a466f982f6d51e1a8591d", "e71eff0e7d2d4e6a9949c29395ac947d", "a05a15c5b5a34a5cbc7a0647a01826fb", "b4f4e484c3c94424a37f6746a590fffd", "e993d4821d984becbb06458f62252f1d", "93ba13309f954c669989147c0823c06b", "4c5388f0293f4bf384332ba0bc775b1c", "4d2d0d8ce7594980aef48885259dafef", "da474d5e97ef4947936ca0dd34a30126", "d0f5166f81f141a88d282b4eb322ec4c", "98bd1a3b82e74f8ab86645815d8ca526", "8a06eaff96f94f2fb8f95fc92ef87651", "123ad9cff4a74609a452474d73c6738c" ] }, "id": "UB0rVxmppnkm", "outputId": "cd8ed4e8-69a0-4ce9-c974-cda0dca998a8", "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d45cf2e391ff4550bc6210a4145de5c7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/1.18k [00:00\n", " \n", " \n", " ''')\n", " return HTML(data=f'''\n", "
\n", " {''.join(figures)}\n", "
\n", " ''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin testing some queries." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"a dog being confused\")\n", "display_gif(gifs)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"animals being cute\")\n", "display_gif(gifs)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"people being angry\")\n", "display_gif(gifs)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"a man dancing\")\n", "display_gif(gifs)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 122 }, "id": "DGHLvLLQBizb", "outputId": "e92aa493-d3a4-4f76-f1b8-41313acd8100" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"a woman dancing\")\n", "display_gif(gifs)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 122 }, "id": "4VTSoj8xL_u_", "outputId": "1c942c5e-7a21-4789-8ff4-ccbb3ffd0eb7" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"an animal dancing\")\n", "display_gif(gifs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's describe the third GIF with the ginger dog dancing on his hind legs." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gifs = search_gif(\"a fluffy dog being cute and dancing like a person\")\n", "display_gif(gifs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These look like pretty good results.\n", "\n", "---" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "gif_search.ipynb", "provenance": [] }, "environment": { "kernel": "python3", "name": "common-cu110.m91", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cu110:m91" }, "interpreter": { "hash": "b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "widgets": {} }, "nbformat": 4, "nbformat_minor": 4 }