kusumakar commited on
Commit
2ef108a
1 Parent(s): f9e9301

Upload Word_Embeddings_with_Keras_sentiment_analysis.ipynb

Browse files
Word_Embeddings_with_Keras_sentiment_analysis.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"cells":[{"cell_type":"markdown","metadata":{"id":"RuSGAe_9YCv6"},"source":["#### Keras Embedding Layer\n","Keras offers an Embedding layer that can be used for neural networks on text data.\n","\n","\n","\n","It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.\n","\n","Embedding layer can be used:\n","\n"," * Alone to learn a word embedding that can be saved and used in another model later.\n"," * As part of a deep learning model where the embedding is learned along with the model itself.\n"," * To load a pre-trained word embedding model, a type of transfer learning.\n","\n","\n","Keras __Embedding__ turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]. This layer can only be used as the first layer in a model.\n","\n","\n","The Embedding layer is defined as the first hidden layer of a network. \n","\n","Imp Arguments:\n","\n"," input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. e.g. if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.\n"," output_dim: int >= 0. Dimension of the dense embedding. It defines the size of the output vectors from this layer for each word.\n"," input_length: Length of input sequences. For example, if all of your input documents are comprised of 1000 words, this would be 1000."]},{"cell_type":"code","execution_count":62,"metadata":{"id":"UCV_d34xYCv-","executionInfo":{"status":"ok","timestamp":1667126163033,"user_tz":-330,"elapsed":829,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["import tensorflow as tf\n","import numpy as np"]},{"cell_type":"code","execution_count":63,"metadata":{"id":"Ph14WQFDYCv_","executionInfo":{"status":"ok","timestamp":1667126163573,"user_tz":-330,"elapsed":12,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["from numpy import zeros\n","from numpy import asarray\n","\n","from tensorflow.keras.preprocessing.text import Tokenizer\n","\n","from tensorflow.keras.preprocessing.sequence import pad_sequences\n","\n","from tensorflow.keras.models import Sequential\n","\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.layers import Flatten,Embedding\n"]},{"cell_type":"code","execution_count":64,"metadata":{"id":"MYD-VuuwYCwA","executionInfo":{"status":"ok","timestamp":1667126163574,"user_tz":-330,"elapsed":12,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["# fix random seed for reproducibility\n","np.random.seed(123)\n","tf.random.set_seed(123)"]},{"cell_type":"markdown","metadata":{"id":"M1Wp-9tcYCwA"},"source":["###### Data:\n","Have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem."]},{"cell_type":"code","execution_count":65,"metadata":{"id":"ki-E7eHgYCwB","executionInfo":{"status":"ok","timestamp":1667126163575,"user_tz":-330,"elapsed":11,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["# define documents\n","docs = ['Well done!',\n"," 'Good work',\n"," 'Great effort',\n"," 'nice work',\n"," 'Excellent!',\n"," 'Weak',\n"," 'Poor effort!',\n"," 'not good',\n"," 'poor work',\n"," 'Technique 1: The len() method to find the length of a list in Python. Python has got in-built method – len() to find the size of the list i.e. the length of the list. The len() method accepts an iterable as an argument and it counts and returns the number of elements present in the list.']\n","\n","# define class labels\n","# positive is 1 and negative is 0\n","labels = [1,1,1,1,1,0,0,0,0,0]"]},{"cell_type":"markdown","metadata":{"id":"vVOGFcMaYCwB"},"source":["Integer encode each document. This means that as input the Embedding layer will have sequences of integers. "]},{"cell_type":"markdown","metadata":{"id":"y9O2QeOXYCwC"},"source":["__Tokenizer__\n","\n"," Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).\n","\n","__fit_on_texts(texts)__\n","\n"," Arguments: \n"," texts: list of texts to train on.\n"," \n","* __fit_on_texts( )__ - Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. \n"," \n","__word_index__ attribute: \n","\n"," Dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.\n"," \n","* https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do"]},{"cell_type":"code","execution_count":66,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"qGHpFIT6YCwD","outputId":"40c0ea3f-ee94-45d8-b7df-408fbbd0a0d8","executionInfo":{"status":"ok","timestamp":1667126163575,"user_tz":-330,"elapsed":10,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["OrderedDict([('well', 1), ('done', 1), ('good', 2), ('work', 3), ('great', 1), ('effort', 2), ('nice', 1), ('excellent', 1), ('weak', 1), ('poor', 2), ('not', 1), ('technique', 1), ('1', 1), ('the', 9), ('len', 3), ('method', 3), ('to', 2), ('find', 2), ('length', 2), ('of', 4), ('a', 1), ('list', 4), ('in', 3), ('python', 2), ('has', 1), ('got', 1), ('built', 1), ('–', 1), ('size', 1), ('i', 1), ('e', 1), ('accepts', 1), ('an', 2), ('iterable', 1), ('as', 1), ('argument', 1), ('and', 2), ('it', 1), ('counts', 1), ('returns', 1), ('number', 1), ('elements', 1), ('present', 1)])\n","{'the': 1, 'of': 2, 'list': 3, 'work': 4, 'len': 5, 'method': 6, 'in': 7, 'good': 8, 'effort': 9, 'poor': 10, 'to': 11, 'find': 12, 'length': 13, 'python': 14, 'an': 15, 'and': 16, 'well': 17, 'done': 18, 'great': 19, 'nice': 20, 'excellent': 21, 'weak': 22, 'not': 23, 'technique': 24, '1': 25, 'a': 26, 'has': 27, 'got': 28, 'built': 29, '–': 30, 'size': 31, 'i': 32, 'e': 33, 'accepts': 34, 'iterable': 35, 'as': 36, 'argument': 37, 'it': 38, 'counts': 39, 'returns': 40, 'number': 41, 'elements': 42, 'present': 43}\n","44\n"]}],"source":["# Prepare tokenizer\n","# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer\n","\n","t = Tokenizer()\n","\n","t.fit_on_texts(docs)\n","print(t.word_counts)\n","print(t.word_index)\n","\n","vocab_size = len(t.word_index) + 1\n","print (vocab_size)"]},{"cell_type":"markdown","metadata":{"id":"_6FQoQMMYCwE"},"source":["__texts_to_sequences(texts)__\n","\n"," Arguments:\n"," texts: list of texts to turn to sequences.\n"," Return: list of sequences (one per text input).\n","\n","* __texts_to_sequences( )__ - Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary."]},{"cell_type":"code","execution_count":67,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"R1caiv0rYCwE","outputId":"f23c8feb-db54-4ece-b9ff-ba493dbff285","executionInfo":{"status":"ok","timestamp":1667126175618,"user_tz":-330,"elapsed":681,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Technique 1: The len() method to find the length of a list in Python. Python has got in-built method – len() to find the size of the list i.e. the length of the list. The len() method accepts an iterable as an argument and it counts and returns the number of elements present in the list.']\n","[[17, 18], [8, 4], [19, 9], [20, 4], [21], [22], [10, 9], [23, 8], [10, 4], [24, 25, 1, 5, 6, 11, 12, 1, 13, 2, 26, 3, 7, 14, 14, 27, 28, 7, 29, 6, 30, 5, 11, 12, 1, 31, 2, 1, 3, 32, 33, 1, 13, 2, 1, 3, 1, 5, 6, 34, 15, 35, 36, 15, 37, 16, 38, 39, 16, 40, 1, 41, 2, 42, 43, 7, 1, 3]]\n"]}],"source":["# integer encode the documents\n","encoded_docs = t.texts_to_sequences(docs)\n","print(docs)\n","print(encoded_docs)"]},{"cell_type":"markdown","metadata":{"id":"H8sEb8fwYCwF"},"source":["The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras's pad_sequences() function."]},{"cell_type":"code","execution_count":68,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"_Mss1NmPYCwF","outputId":"51745edf-3009-4d45-f81b-71f4d5e6148c","executionInfo":{"status":"ok","timestamp":1667126185782,"user_tz":-330,"elapsed":499,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["[[17 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [ 8 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [19 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [20 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [10 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [23 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [10 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n"," 0 0 0 0 0 0 0 0 0 0]\n"," [24 25 1 5 6 11 12 1 13 2 26 3 7 14 14 27 28 7 29 6 30 5 11 12\n"," 1 31 2 1 3 32 33 1 13 2 1 3 1 5 6 34 15 35 36 15 37 16 38 39\n"," 16 40 1 41 2 42 43 7 1 3]]\n"]}],"source":["from tensorflow.python.ops.math_ops import Any\n","# pad documents to a max length of 4 words this is auto i hAVE used\n","padded_docs = pad_sequences(encoded_docs,padding='post')\n","print(padded_docs)\n"]},{"cell_type":"code","source":["# finding the max input length\n","print(padded_docs.size)\n","print(len(padded_docs))\n","max_input_length =(padded_docs.size)/len(padded_docs)\n","print(max_input_length)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"D4l2D2-sctya","executionInfo":{"status":"ok","timestamp":1667126874857,"user_tz":-330,"elapsed":7,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}},"outputId":"b4441d68-7d96-4229-fd83-190c245d81d2"},"execution_count":86,"outputs":[{"output_type":"stream","name":"stdout","text":["580\n","10\n","58.0\n"]}]},{"cell_type":"code","execution_count":78,"metadata":{"id":"J4l3PqwXYCwF","executionInfo":{"status":"ok","timestamp":1667126555498,"user_tz":-330,"elapsed":12,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["labels=np.array(labels)"]},{"cell_type":"code","source":["print(labels)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FlDvumisqWTr","outputId":"241e4af5-c7e8-4d13-db29-459d5ef55278","executionInfo":{"status":"ok","timestamp":1667126557831,"user_tz":-330,"elapsed":4,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"execution_count":79,"outputs":[{"output_type":"stream","name":"stdout","text":["[1 1 1 1 1 0 0 0 0 0]\n"]}]},{"cell_type":"markdown","metadata":{"id":"fd-zoIcSYCwG"},"source":["The Embedding has a vocabulary of 15 and an input length of 4. We will choose a small embedding space of 8 dimensions.\n","\n","The model is a simple binary classification model. \n","\n","Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer."]},{"cell_type":"code","execution_count":81,"metadata":{"id":"_GfKGgMjYCwG","executionInfo":{"status":"ok","timestamp":1667126597038,"user_tz":-330,"elapsed":5,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["# define the model\n","model = Sequential()\n","model.add(Embedding(vocab_size, 8, input_length= int(max_input_length) ))\n","model.add(Flatten())\n","model.add(Dense(1, activation='sigmoid'))"]},{"cell_type":"code","execution_count":82,"metadata":{"id":"IMFCvRbLYCwG","executionInfo":{"status":"ok","timestamp":1667126601230,"user_tz":-330,"elapsed":7,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[],"source":["# compile the model\n","model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])"]},{"cell_type":"code","execution_count":83,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ZKdCWQSBYCwH","outputId":"11796b2f-bd18-4896-c1f9-a14a392a5d90","executionInfo":{"status":"ok","timestamp":1667126604406,"user_tz":-330,"elapsed":794,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Model: \"sequential_10\"\n","_________________________________________________________________\n"," Layer (type) Output Shape Param # \n","=================================================================\n"," embedding_10 (Embedding) (None, 58, 8) 352 \n"," \n"," flatten_8 (Flatten) (None, 464) 0 \n"," \n"," dense_8 (Dense) (None, 1) 465 \n"," \n","=================================================================\n","Total params: 817\n","Trainable params: 817\n","Non-trainable params: 0\n","_________________________________________________________________\n","None\n"]}],"source":["# summarize the model\n","print(model.summary())"]},{"cell_type":"markdown","source":["for embedding param is 352 i.e 44*8(total number of unique word * output dimension)"],"metadata":{"id":"2LnrGlpIfziC"}},{"cell_type":"code","source":["embeddings = model.layers[0].get_weights()[0]"],"metadata":{"id":"XyvrjuT4rvTr","executionInfo":{"status":"ok","timestamp":1667126963814,"user_tz":-330,"elapsed":760,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"execution_count":87,"outputs":[]},{"cell_type":"code","source":["embeddings"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HKaxw1d7sevQ","outputId":"d61b5c1d-9bd1-49d4-f373-141a9b9066c1","executionInfo":{"status":"ok","timestamp":1667126965849,"user_tz":-330,"elapsed":6,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"execution_count":88,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[-3.73846889e-02, 7.27512687e-03, -2.00686697e-02,\n"," 4.61835787e-03, 2.20515765e-02, 2.88953297e-02,\n"," -1.92318801e-02, -1.82889774e-03],\n"," [ 1.53775252e-02, -3.79007459e-02, -1.37082562e-02,\n"," -2.55100019e-02, -2.73043867e-02, 2.01003626e-03,\n"," 3.68466265e-02, -2.13810559e-02],\n"," [-8.15294683e-04, 1.74988396e-02, -5.98294660e-03,\n"," -1.07008219e-03, 3.05635221e-02, 3.97428386e-02,\n"," 1.36615969e-02, 1.20054930e-04],\n"," [-1.61092356e-03, 2.52256282e-02, 3.06485929e-02,\n"," -9.68657434e-04, -8.23821872e-03, 4.19750698e-02,\n"," -1.95727702e-02, 4.12120558e-02],\n"," [ 1.48617961e-02, -1.51228905e-02, 1.58478655e-02,\n"," 3.37831415e-02, 3.04305442e-02, -1.41870491e-02,\n"," -1.00730881e-02, -1.44804642e-03],\n"," [-4.17529456e-02, -7.88740069e-03, -4.96819504e-02,\n"," 4.88086231e-02, 6.93566725e-03, -3.47149149e-02,\n"," 8.68083164e-03, 4.95600142e-02],\n"," [-4.99248281e-02, 5.38706779e-03, -2.82434821e-02,\n"," -4.14521694e-02, 2.05350034e-02, 8.14529508e-03,\n"," -4.52499390e-02, 6.67800754e-03],\n"," [-3.73555347e-03, 2.55221166e-02, -7.97390938e-04,\n"," -1.59323812e-02, 3.76414768e-02, -3.68352905e-02,\n"," 1.70810856e-02, 2.46713050e-02],\n"," [ 1.59634240e-02, 3.53474542e-03, -1.38340816e-02,\n"," 4.54132669e-02, 3.74595411e-02, 1.13742836e-02,\n"," 9.48905945e-05, 4.01855633e-03],\n"," [-2.17728615e-02, -4.96674553e-02, -3.34239602e-02,\n"," 7.54372030e-03, -1.67850964e-02, -3.78339365e-03,\n"," 3.43035199e-02, 2.40279324e-02],\n"," [ 2.84449197e-02, 4.12758924e-02, -3.59766968e-02,\n"," 3.29828262e-03, 2.11201347e-02, 2.09593438e-02,\n"," 3.04187052e-02, 1.36537291e-02],\n"," [ 1.97747387e-02, -4.58305739e-02, 2.86179446e-02,\n"," 9.67669487e-03, -4.10770886e-02, -2.17715651e-03,\n"," 3.10718156e-02, -9.64919478e-03],\n"," [ 3.94263379e-02, -4.09096852e-02, 1.52977593e-02,\n"," -9.12551954e-03, -1.62853710e-02, 3.41033377e-02,\n"," 2.09313072e-02, 3.22674029e-02],\n"," [ 6.24610111e-03, 2.39354856e-02, -4.62426208e-02,\n"," -3.53406183e-02, 3.88574488e-02, -4.05268744e-03,\n"," -4.61149216e-02, 9.33254883e-03],\n"," [-3.63632068e-02, 2.90089734e-02, 4.35407199e-02,\n"," -3.51608284e-02, 3.04072611e-02, -2.25334167e-02,\n"," -4.56764810e-02, 4.08405103e-02],\n"," [-2.67855171e-02, -3.11880708e-02, -2.53331549e-02,\n"," 2.16104649e-02, 4.53344621e-02, 1.07851140e-02,\n"," -2.84960028e-02, -4.10287380e-02],\n"," [ 5.12887165e-03, 6.28210232e-03, -1.06710196e-03,\n"," -2.53872946e-03, -2.97323465e-02, -4.41568270e-02,\n"," -3.59994285e-02, -1.09092817e-02],\n"," [-3.34452987e-02, -4.86493818e-02, -4.11182642e-02,\n"," -2.34498736e-02, 1.66600607e-02, 2.87305452e-02,\n"," -2.51011848e-02, -1.65385008e-03],\n"," [ 1.76581629e-02, -1.24146566e-02, -3.19283754e-02,\n"," -9.19172913e-03, -3.63197923e-02, 3.90795581e-02,\n"," -3.49396355e-02, 3.72645371e-02],\n"," [ 3.86117771e-03, 2.66224779e-02, 9.79762152e-03,\n"," 3.50576378e-02, -1.72933228e-02, 3.04682143e-02,\n"," 1.88184641e-02, -3.35905552e-02],\n"," [-1.49324648e-02, 2.87554413e-03, 2.43748538e-02,\n"," -1.92109346e-02, 1.29871406e-02, 9.85000283e-03,\n"," 2.26676203e-02, 4.53256480e-02],\n"," [-3.71024497e-02, 1.45847462e-02, -4.37489152e-02,\n"," -4.47130203e-03, -4.69660871e-02, 4.63272072e-02,\n"," -4.84150425e-02, 1.08601339e-02],\n"," [-6.20715693e-03, 3.88504304e-02, -1.44787543e-02,\n"," 2.46851519e-03, -2.60219108e-02, 2.93329842e-02,\n"," -4.24114466e-02, -3.40846069e-02],\n"," [ 2.38288902e-02, 2.75624134e-02, 2.22815610e-02,\n"," 4.17799838e-02, 3.52298506e-02, 1.42545737e-02,\n"," -3.31473351e-02, 2.27704532e-02],\n"," [-1.29266605e-02, 2.56579034e-02, 3.85562219e-02,\n"," -1.46364197e-02, -2.00973395e-02, -4.31649685e-02,\n"," -3.82302403e-02, 3.38974334e-02],\n"," [ 7.70139694e-03, 3.03115584e-02, -1.38511434e-02,\n"," 1.44773833e-02, 4.27861847e-02, -3.97859700e-02,\n"," -3.09140608e-03, -4.79584225e-02],\n"," [-3.54036205e-02, -1.01224296e-02, 1.97085254e-02,\n"," 2.78046615e-02, 6.19192049e-03, -1.99920665e-02,\n"," 3.59411575e-02, -4.16038521e-02],\n"," [ 4.20440547e-02, 4.44380082e-02, 1.89738013e-02,\n"," -4.41905111e-03, 4.57917713e-02, -4.02773395e-02,\n"," 6.38096407e-03, -1.05386972e-02],\n"," [ 3.39266770e-02, 3.13455574e-02, 4.58661579e-02,\n"," 3.40985097e-02, 4.34301011e-02, -2.18667835e-03,\n"," -4.58869115e-02, 1.72469765e-03],\n"," [ 2.16239206e-02, 9.58836079e-03, -1.71268955e-02,\n"," -1.76198110e-02, 1.19636208e-03, 3.17904465e-02,\n"," -4.41205502e-03, 1.07775331e-02],\n"," [ 2.10052170e-02, -1.05173476e-02, -1.31513104e-02,\n"," -1.99245289e-03, 8.50247219e-03, -1.75064914e-02,\n"," 3.30781825e-02, 6.77232817e-03],\n"," [-3.21600437e-02, -2.35277303e-02, 1.68073177e-03,\n"," -2.74067167e-02, 3.38590629e-02, 4.94538806e-02,\n"," -1.32053979e-02, -1.17713101e-02],\n"," [-9.37622786e-03, -3.43824737e-02, 1.76773220e-03,\n"," 2.72498243e-02, 2.06131972e-02, -4.88992445e-02,\n"," -2.10747607e-02, 4.71951626e-02],\n"," [-1.19681284e-03, -1.20656863e-02, 3.47117893e-02,\n"," 3.43601741e-02, -3.83443609e-02, -2.91424636e-02,\n"," 3.58615257e-02, -2.60264874e-02],\n"," [ 3.42187397e-02, -9.38353688e-03, -2.98652183e-02,\n"," 3.02135088e-02, 6.96555525e-03, 1.10629313e-02,\n"," 3.14553417e-02, -2.33733188e-02],\n"," [ 2.77623050e-02, 4.60577644e-02, 2.50572450e-02,\n"," 3.70040797e-02, 1.15445256e-02, 2.51598842e-02,\n"," -1.41607746e-02, 4.74539734e-02],\n"," [-6.14858791e-03, 4.77262847e-02, 4.83403318e-02,\n"," 3.52835171e-02, 3.19558270e-02, 6.82250410e-03,\n"," 1.98443048e-02, -8.56604427e-03],\n"," [-3.02298795e-02, -1.83042176e-02, -4.93257307e-02,\n"," 4.09810580e-02, 3.15879025e-02, 3.02027203e-02,\n"," -3.33208218e-02, 5.14528900e-03],\n"," [ 8.53549689e-04, -1.46236792e-02, 1.13251917e-02,\n"," -7.70412758e-03, -2.49325037e-02, 4.12852690e-03,\n"," -3.26333866e-02, -1.03624910e-03],\n"," [ 4.62754816e-03, 1.25866793e-02, -8.11139494e-03,\n"," 1.36259235e-02, -4.97647412e-02, 4.89428975e-02,\n"," 2.87425630e-02, -4.77960370e-02],\n"," [-6.31583855e-03, 4.63945158e-02, -3.72473225e-02,\n"," 4.69228514e-02, -4.29987311e-02, 8.23800638e-03,\n"," 2.79261917e-03, -4.03857604e-02],\n"," [-2.27055084e-02, -3.34338173e-02, -4.63545434e-02,\n"," 3.42030264e-02, -4.11571972e-02, -2.77714618e-02,\n"," -4.93923202e-02, -3.16617638e-02],\n"," [ 2.49072202e-02, 3.23448218e-02, 3.21061648e-02,\n"," 3.26068513e-02, -1.61221623e-02, 9.67570394e-03,\n"," -2.64284741e-02, -2.81318780e-02],\n"," [ 1.97460763e-02, 4.28152420e-02, 4.88588959e-03,\n"," 4.00176384e-02, 2.55079605e-02, -6.54534250e-03,\n"," 4.25359868e-02, -8.58135149e-03]], dtype=float32)"]},"metadata":{},"execution_count":88}]},{"cell_type":"code","source":["embeddings.shape"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"q0_FzSARrUjk","outputId":"1e33ad6e-3ea1-47e3-ccbc-213fed3df365","executionInfo":{"status":"ok","timestamp":1667126970505,"user_tz":-330,"elapsed":721,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"execution_count":89,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(44, 8)"]},"metadata":{},"execution_count":89}]},{"cell_type":"code","execution_count":90,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"b5cPBwAvYCwH","outputId":"6be27f90-b4e1-480c-844c-dcc2fd45e0b4","executionInfo":{"status":"ok","timestamp":1667126973452,"user_tz":-330,"elapsed":736,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":["<keras.callbacks.History at 0x7fe3d3efca90>"]},"metadata":{},"execution_count":90}],"source":["# fit the model\n","model.fit(padded_docs, labels, epochs=50,verbose=0)"]},{"cell_type":"code","execution_count":91,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"bT9_DvzEYCwH","outputId":"3a1006fb-9b9e-484f-8492-8297b297e121","executionInfo":{"status":"ok","timestamp":1667126976926,"user_tz":-330,"elapsed":628,"user":{"displayName":"Putturu kusumakar Reddy","userId":"07552141107752951949"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Accuracy: 60.000002\n"]}],"source":["# evaluate the model\n","loss, accuracy = model.evaluate(padded_docs, labels,verbose=0)\n","print('Accuracy: %f' % (accuracy*100))"]},{"cell_type":"markdown","metadata":{"id":"oGrL93otYCwI"},"source":["You could save the learned weights from the Embedding layer to file for later use in other models.\n","\n","You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.\n","\n","* https://stackoverflow.com/questions/51235118/how-to-get-word-vectors-from-keras-embedding-layer"]},{"cell_type":"markdown","metadata":{"id":"BRROrcnZYCwI"},"source":["## Using Pre-Trained GloVe Embedding\n","\n","The Keras Embedding layer can also use a word embedding learned elsewhere.\n","\n","It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.\n","\n","For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license.\n","\n","The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.\n","\n","You can download this collection of embeddings from https://nlp.stanford.edu/projects/glove/ and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.\n","\n","After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.\n","\n","\n","If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. "]},{"cell_type":"markdown","metadata":{"id":"iHTjgfGiYCwI"},"source":["###### load the GloVe word embedding file into memory as a dictionary of word to embedding array.\n","\n","__Note__: Filter the embedding for the unique words in the training data.\n"]},{"cell_type":"code","source":["#### mount google drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"A2sTLGBN2ZHV","outputId":"754c7df5-6e95-4f70-ebf0-26b0e3e7e507"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DDxw6zOwYCwJ","outputId":"6c19aca0-3058-4bc8-c5ce-96ce350ccabb"},"outputs":[{"output_type":"stream","name":"stdout","text":["Loaded 400000 word vectors.\n"]}],"source":["# load the whole embedding into memory\n","embeddings_index = dict()\n","\n","f = open('/content/drive/MyDrive/NLP/Deep_Learning/PGP/WordEmbeddings/glove.6B.50d.txt')\n","\n","for line in f:\n"," values = line.split()\n"," word = values[0]\n"," coefs = asarray(values[1:], dtype='float32')\n"," embeddings_index[word] = coefs\n","f.close()\n","print('Loaded %s word vectors.' % len(embeddings_index))"]},{"cell_type":"markdown","metadata":{"id":"0bNG6kKHYCwJ"},"source":["Next, create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.\n","\n","The result is a matrix of weights only for words we will see during training."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"k5vQc8REYCwJ"},"outputs":[],"source":["# Example to create a zero matrix\n","embedding_matrix_1 = zeros((vocab_size, 5))"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8EGhNhrIYCwK","outputId":"2335b6fe-e6a5-4856-b8ea-559cf2d6a6d4"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.],\n"," [0., 0., 0., 0., 0.]])"]},"metadata":{},"execution_count":28}],"source":["embedding_matrix_1"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Y8-wBNfFYCwK","outputId":"504cc697-f786-4f3a-f609-41414d1a9cf2"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["dict_items([('work', 1), ('done', 2), ('good', 3), ('effort', 4), ('poor', 5), ('well', 6), ('great', 7), ('nice', 8), ('excellent', 9), ('weak', 10), ('not', 11), ('could', 12), ('have', 13), ('better', 14)])"]},"metadata":{},"execution_count":29}],"source":["t.word_index.items()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"nSgUxD3EYCwK","outputId":"b97bfa47-7ebd-4d78-c7dc-eb68226b6f6f","colab":{"base_uri":"https://localhost:8080/"}},"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([-0.26241 , -1.1103 , 0.50271 , -0.43052 , 0.37468 , -0.3055 ,\n"," 0.36708 , 0.25938 , -0.16993 , 0.54245 , 0.63919 , 0.11347 ,\n"," -0.3919 , 0.31521 , -0.42901 , 0.49977 , -0.2376 , -0.79307 ,\n"," 0.34494 , -0.47877 , -0.51945 , -0.50665 , 0.057701, -0.31797 ,\n"," -0.080134, -1.0289 , -0.1507 , 0.50944 , 0.60715 , 1.3049 ,\n"," 3.2575 , 0.11849 , 1.5057 , -0.36649 , -0.17726 , -0.20931 ,\n"," -0.59527 , -0.025889, -0.2965 , -1.1387 , -0.52999 , 0.067286,\n"," 0.094954, 0.049722, 0.51323 , -0.11194 , -0.007111, 0.23775 ,\n"," 0.68874 , 0.13873 ], dtype=float32)"]},"metadata":{},"execution_count":30}],"source":["embeddings_index.get('weak')"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HjHnVLJcYCwK","outputId":"760fea9f-cf0f-499f-c88a-6a00a7becf54"},"outputs":[{"output_type":"stream","name":"stdout","text":["work\n","1\n","done\n","2\n","good\n","3\n","effort\n","4\n","poor\n","5\n","well\n","6\n","great\n","7\n","nice\n","8\n","excellent\n","9\n","weak\n","10\n","not\n","11\n","could\n","12\n","have\n","13\n","better\n","14\n"]}],"source":["# create a weight matrix for words in training docs\n","embedding_matrix = zeros((vocab_size, 50))\n","\n","for word, i in t.word_index.items():\n"," print(word)\n"," print(i)\n"," embedding_vector = embeddings_index.get(word)\n"," if embedding_vector is not None:\n"," embedding_matrix[i] = embedding_vector"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"By6hPBDMYCwL","outputId":"07bced95-e3a4-40be-bf4b-816312e3d2e9"},"outputs":[{"output_type":"stream","name":"stdout","text":["[ 5.13589978e-01 1.96950004e-01 -5.19439995e-01 -8.62179995e-01\n"," 1.54940002e-02 1.09729998e-01 -8.02929997e-01 -3.33609998e-01\n"," -1.61189993e-04 1.01889996e-02 4.67340015e-02 4.67510015e-01\n"," -4.74750012e-01 1.10380001e-01 3.93269986e-01 -4.36520010e-01\n"," 3.99839997e-01 2.71090001e-01 4.26499993e-01 -6.06400013e-01\n"," 8.11450005e-01 4.56299990e-01 -1.27260000e-01 -2.24739999e-01\n"," 6.40709996e-01 -1.27670002e+00 -7.22310007e-01 -6.95900023e-01\n"," 2.80450005e-02 -2.30719998e-01 3.79959989e+00 -1.26249999e-01\n"," -4.79669988e-01 -9.99719977e-01 -2.19760001e-01 5.05649984e-01\n"," 2.59530004e-02 8.05140018e-01 1.99290007e-01 2.87959993e-01\n"," -1.59150004e-01 -3.04380000e-01 1.60249993e-01 -1.82899997e-01\n"," -3.85629982e-02 -1.76190004e-01 2.70409994e-02 4.68420014e-02\n"," -6.28970027e-01 3.57259989e-01]\n"]}],"source":["print(embedding_matrix[1])"]},{"cell_type":"code","source":["embeddings_index.get('work')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FNvOKLJqy_v_","outputId":"03ac72f4-320d-4c62-da1e-4d505ccd0484"},"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([ 5.1359e-01, 1.9695e-01, -5.1944e-01, -8.6218e-01, 1.5494e-02,\n"," 1.0973e-01, -8.0293e-01, -3.3361e-01, -1.6119e-04, 1.0189e-02,\n"," 4.6734e-02, 4.6751e-01, -4.7475e-01, 1.1038e-01, 3.9327e-01,\n"," -4.3652e-01, 3.9984e-01, 2.7109e-01, 4.2650e-01, -6.0640e-01,\n"," 8.1145e-01, 4.5630e-01, -1.2726e-01, -2.2474e-01, 6.4071e-01,\n"," -1.2767e+00, -7.2231e-01, -6.9590e-01, 2.8045e-02, -2.3072e-01,\n"," 3.7996e+00, -1.2625e-01, -4.7967e-01, -9.9972e-01, -2.1976e-01,\n"," 5.0565e-01, 2.5953e-02, 8.0514e-01, 1.9929e-01, 2.8796e-01,\n"," -1.5915e-01, -3.0438e-01, 1.6025e-01, -1.8290e-01, -3.8563e-02,\n"," -1.7619e-01, 2.7041e-02, 4.6842e-02, -6.2897e-01, 3.5726e-01],\n"," dtype=float32)"]},"metadata":{},"execution_count":32}]},{"cell_type":"markdown","metadata":{"id":"5qFg-HJpYCwL"},"source":["Define our model, fit, and evaluate it as before.\n","\n","The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. \n","\n"," We chose the 50-dimensional version, therefore the Embedding layer must be defined with output_dim set to 50. \n"," We do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tj3QmKsjYCwL"},"outputs":[],"source":["# define model\n","model = Sequential()\n","model.add(Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=4, trainable=False))\n","model.add(Flatten())\n","model.add(Dense(1, activation='sigmoid'))"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"of68--HRYCwL"},"outputs":[],"source":["# compile the model\n","model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"CELGTfchYCwL","outputId":"5ad7c9cb-aea7-4785-caab-58243fda727d"},"outputs":[{"output_type":"stream","name":"stdout","text":["Model: \"sequential_1\"\n","_________________________________________________________________\n"," Layer (type) Output Shape Param # \n","=================================================================\n"," embedding_1 (Embedding) (None, 4, 50) 750 \n"," \n"," flatten_1 (Flatten) (None, 200) 0 \n"," \n"," dense_1 (Dense) (None, 1) 201 \n"," \n","=================================================================\n","Total params: 951\n","Trainable params: 201\n","Non-trainable params: 750\n","_________________________________________________________________\n","None\n"]}],"source":["# summarize the model\n","print(model.summary())"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9Yw51AgSYCwM","outputId":"e11d8d10-d0d2-4738-98cd-9764b25f0322"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["<keras.callbacks.History at 0x7fbf39a3b5d0>"]},"metadata":{},"execution_count":37}],"source":["# fit the model\n","model.fit(padded_docs, labels, epochs=500, verbose=0)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BZP-lOVqYCwM","outputId":"876d9045-bca5-47e6-a2f4-992235d359ef"},"outputs":[{"output_type":"stream","name":"stdout","text":["Accuracy: 100.000000\n"]}],"source":["# evaluate the model\n","loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)\n","\n","print('Accuracy: %f' % (accuracy*100))"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"6priQt5_YCwM","colab":{"base_uri":"https://localhost:8080/"},"outputId":"6718f6a9-4d24-4a8d-f912-916b32928207"},"outputs":[{"output_type":"stream","name":"stdout","text":["1/1 [==============================] - 0s 81ms/step\n"]}],"source":["predict_label=model.predict(padded_docs)"]},{"cell_type":"code","source":["predict_label = np.round(predict_label).astype(int)"],"metadata":{"id":"6MLcLmfB6GIW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["predict_label\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BjDy_QE9675H","outputId":"5459e983-f637-4508-f4ca-8e73fc328078"},"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[1],\n"," [1],\n"," [1],\n"," [1],\n"," [1],\n"," [0],\n"," [0],\n"," [0],\n"," [0],\n"," [0]])"]},"metadata":{},"execution_count":41}]},{"cell_type":"markdown","metadata":{"id":"5etQMXOPYCwN"},"source":["__References:__\n","\n"," https://keras.io\n"," https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer\n","\n"," https://machinelearningmastery.com\n"," \n"," https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html "]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"colab":{"provenance":[],"collapsed_sections":[]},"gpuClass":"standard"},"nbformat":4,"nbformat_minor":0}