{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-Jv7Y4hXwt0j" }, "source": [ "# Question duplicates\n", "\n", "We will explore Siamese networks applied to natural language processing. We will further explore the fundamentals of TensorFlow and we will be able to implement a more complicated structure using it. By completing this project, we will learn how to implement models with different architectures. \n", "\n", "\n", "## Outline\n", "\n", "- [Overview](#0)\n", "- [Part 1: Importing the Data](#1)\n", " - [1.1 Loading in the data](#1.1)\n", " - [1.2 Learn question encoding](#1.2)\n", "- [Part 2: Defining the Siamese model](#2)\n", " - [2.1 Understanding the Siamese Network](#2.1)\n", " - [Exercise 01](#ex01)\n", " - [2.2 Hard Negative Mining](#2.2)\n", " - [Exercise 02](#ex02)\n", "- [Part 3: Training](#3)\n", " - [3.1 Training the model](#3.1)\n", " - [Exercise 03](#ex03)\n", "- [Part 4: Evaluation](#4)\n", " - [4.1 Evaluating your siamese network](#4.1)\n", " - [4.2 Classify](#4.2)\n", " - [Exercise 04](#ex04)\n", "- [Part 5: Testing with your own questions](#5)\n", " - [Exercise 05](#ex05)\n", "- [On Siamese networks](#6)\n", "\n", "\n", "### Overview\n", "In particular, in this assignment you will: \n", "\n", "- Learn about Siamese networks\n", "- Understand how the triplet loss works\n", "- Understand how to evaluate accuracy\n", "- Use cosine similarity between the model's outputted vectors\n", "- Use the data generator to get batches of questions\n", "- Predict using your own model\n", "\n", "By now, you should be familiar with Tensorflow and know how to make use of it to define your model. We will start this homework by asking you to create a vocabulary in a similar way as you did in the previous assignments. After this, you will build a classifier that will allow you to identify whether two questions are the same or not. \n", "\n", "\n", "\n", "\n", "Your model will take in the two questions, which will be transformed into tensors, each tensor will then go through embeddings, and after that an LSTM. Finally you will compare the outputs of the two subnetworks using cosine similarity. \n", "\n", "Before taking a deep dive into the model, you will start by importing the data set, and exploring it a bit.\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4sF9Hqzgwt0l" }, "source": [ "###### \n", "# Part 1: Importing the Data\n", "\n", "### 1.1 Loading in the data\n", "\n", "You will be using the 'Quora question answer' dataset to build a model that can identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has already been labeled for you. Run the cell below to import some of the packages you will be using. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "deletable": false, "editable": false, "id": "zdACgs491cs2", "outputId": "b31042ef-845b-46b8-c783-185e96b135f7" }, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import random as rnd\n", "import tensorflow as tf\n" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "deletable": false, "editable": false }, "outputs": [], "source": [ "import w3_unittest" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3GYhQRMspitx" }, "source": [ "You will now load the data set. We have done some preprocessing for you. If you have taken the deeplearning specialization, this is a slightly different training method than the one you have seen there. If you have not, then don't worry about it, we will explain everything. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 528 }, "colab_type": "code", "deletable": false, "editable": false, "id": "sXWBVGWnpity", "outputId": "afa90d4d-fed7-43b8-bcba-48c95d600ad5", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of question pairs: 404351\n" ] }, { "data": { "text/html": [ "
\n", " | id | \n", "qid1 | \n", "qid2 | \n", "question1 | \n", "question2 | \n", "is_duplicate | \n", "
---|---|---|---|---|---|---|
0 | \n", "0 | \n", "1 | \n", "2 | \n", "What is the step by step guide to invest in sh... | \n", "What is the step by step guide to invest in sh... | \n", "0 | \n", "
1 | \n", "1 | \n", "3 | \n", "4 | \n", "What is the story of Kohinoor (Koh-i-Noor) Dia... | \n", "What would happen if the Indian government sto... | \n", "0 | \n", "
2 | \n", "2 | \n", "5 | \n", "6 | \n", "How can I increase the speed of my internet co... | \n", "How can Internet speed be increased by hacking... | \n", "0 | \n", "
3 | \n", "3 | \n", "7 | \n", "8 | \n", "Why am I mentally very lonely? How can I solve... | \n", "Find the remainder when [math]23^{24}[/math] i... | \n", "0 | \n", "
4 | \n", "4 | \n", "9 | \n", "10 | \n", "Which one dissolve in water quikly sugar, salt... | \n", "Which fish would survive in salt water? | \n", "0 | \n", "
Model: \"SiameseModel\"\n",
"
\n"
],
"text/plain": [
"\u001b[1mModel: \"SiameseModel\"\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n", "┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃\n", "┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n", "│ input_1 │ (None, 1) │ 0 │ - │\n", "│ (InputLayer) │ │ │ │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ input_2 │ (None, 1) │ 0 │ - │\n", "│ (InputLayer) │ │ │ │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ sequential │ (None, 128) │ 4,768,256 │ input_1[0][0], │\n", "│ (Sequential) │ │ │ input_2[0][0] │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ conc_1_2 │ (None, 256) │ 0 │ sequential[0][0], │\n", "│ (Concatenate) │ │ │ sequential[1][0] │\n", "└─────────────────────┴───────────────────┴────────────┴───────────────────┘\n", "\n" ], "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mConnected to \u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n", "│ input_1 │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m1\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │ - │\n", "│ (\u001b[38;5;33mInputLayer\u001b[0m) │ │ │ │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ input_2 │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m1\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │ - │\n", "│ (\u001b[38;5;33mInputLayer\u001b[0m) │ │ │ │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ sequential │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m128\u001b[0m) │ \u001b[38;5;34m4,768,256\u001b[0m │ input_1[\u001b[38;5;34m0\u001b[0m][\u001b[38;5;34m0\u001b[0m], │\n", "│ (\u001b[38;5;33mSequential\u001b[0m) │ │ │ input_2[\u001b[38;5;34m0\u001b[0m][\u001b[38;5;34m0\u001b[0m] │\n", "├─────────────────────┼───────────────────┼────────────┼───────────────────┤\n", "│ conc_1_2 │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m256\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │ sequential[\u001b[38;5;34m0\u001b[0m][\u001b[38;5;34m0\u001b[0m], │\n", "│ (\u001b[38;5;33mConcatenate\u001b[0m) │ │ │ sequential[\u001b[38;5;34m1\u001b[0m][\u001b[38;5;34m0\u001b[0m] │\n", "└─────────────────────┴───────────────────┴────────────┴───────────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Total params: 4,768,256 (18.19 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Total params: \u001b[0m\u001b[38;5;34m4,768,256\u001b[0m (18.19 MB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Trainable params: 4,768,256 (18.19 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m4,768,256\u001b[0m (18.19 MB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Non-trainable params: 0 (0.00 B)\n", "\n" ], "text/plain": [ "\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Model: \"sequential\"\n",
"
\n"
],
"text/plain": [
"\u001b[1mModel: \"sequential\"\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n", "┃ Layer (type) ┃ Output Shape ┃ Param # ┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n", "│ text_vectorization │ (None, None) │ 0 │\n", "│ (TextVectorization) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ embedding (Embedding) │ (None, None, 128) │ 4,636,672 │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ LSTM (LSTM) │ (None, None, 128) │ 131,584 │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ mean (GlobalAveragePooling1D) │ (None, 128) │ 0 │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ out (Lambda) │ (None, 128) │ 0 │\n", "└─────────────────────────────────┴────────────────────────┴───────────────┘\n", "\n" ], "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n", "│ text_vectorization │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;45mNone\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n", "│ (\u001b[38;5;33mTextVectorization\u001b[0m) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ embedding (\u001b[38;5;33mEmbedding\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m128\u001b[0m) │ \u001b[38;5;34m4,636,672\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ LSTM (\u001b[38;5;33mLSTM\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m128\u001b[0m) │ \u001b[38;5;34m131,584\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ mean (\u001b[38;5;33mGlobalAveragePooling1D\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m128\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ out (\u001b[38;5;33mLambda\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m128\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n", "└─────────────────────────────────┴────────────────────────┴───────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Total params: 4,768,256 (18.19 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Total params: \u001b[0m\u001b[38;5;34m4,768,256\u001b[0m (18.19 MB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Trainable params: 4,768,256 (18.19 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m4,768,256\u001b[0m (18.19 MB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Non-trainable params: 0 (0.00 B)\n", "\n" ], "text/plain": [ "\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# check your model\n", "model = Siamese(text_vectorization, vocab_size=text_vectorization.vocabulary_size())\n", "model.build(input_shape=None)\n", "model.summary()\n", "model.get_layer(name='sequential').summary()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LMK9zqhHpiuo" }, "source": [ "**Expected output:** \n", "\n", "\n", "\n", "```Model: \"SiameseModel\"\n", "__________________________________________________________________________________________________\n", " Layer (type) Output Shape Param # Connected to \n", "==================================================================================================\n", " input_1 (InputLayer) [(None, 1)] 0 [] \n", " \n", " input_2 (InputLayer) [(None, 1)] 0 [] \n", " \n", " sequential (Sequential) (None, 128) 4768256 ['input_1[0][0]', \n", " 'input_2[0][0]'] \n", " \n", " conc_1_2 (Concatenate) (None, 256) 0 ['sequential[0][0]', \n", " 'sequential[1][0]'] \n", " \n", "==================================================================================================\n", "Total params: 4768256 (18.19 MB)\n", "Trainable params: 4768256 (18.19 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "__________________________________________________________________________________________________\n", "Model: \"sequential\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " text_vectorization (TextVe (None, None) 0 \n", " ctorization) \n", " \n", " embedding (Embedding) (None, None, 128) 4636672 \n", " \n", " LSTM (LSTM) (None, None, 128) 131584 \n", " \n", " mean (GlobalAveragePooling (None, 128) 0 \n", " 1D) \n", " \n", " out (Lambda) (None, 128) 0 \n", " \n", "=================================================================\n", "Total params: 4768256 (18.19 MB)\n", "Trainable params: 4768256 (18.19 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n", "```\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also draw the model for a clearer view of your Siamese network" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "deletable": false, "editable": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "You must install pydot (`pip install pydot`) for `plot_model` to work.\n" ] } ], "source": [ "tf.keras.utils.plot_model(\n", " model,\n", " to_file=\"model.png\",\n", " show_shapes=True,\n", " show_dtype=True,\n", " show_layer_names=True,\n", " rankdir=\"TB\",\n", " expand_nested=True)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "KVo1Gvripiuo" }, "source": [ "\n", "\n", "### 2.2 Hard Negative Mining\n", "\n", "\n", "You will now implement the `TripletLoss` with hard negative mining.