Igor Santana commited on May 19, 2023

Commit

9c58361

1 Parent(s): c9be590

rnn model sent from github to huggingface

Files changed (21) hide show

.editorconfig +12 -0
.gitignore +12 -0
LICENSE +21 -0
README.md +82 -2
analysis/main.py +70 -0
configs/config_sample.yml +46 -0
environment.yml +129 -0
main.py +39 -0
project/__init__.py +0 -0
project/data/preparation.py +87 -0
project/data/preprocess.py +82 -0
project/evaluation/ResultReport.py +27 -0
project/evaluation/metrics.py +30 -0
project/evaluation/ranking_metrics.py +240 -0
project/evaluation/run.py +69 -0
project/models/embeddings.py +166 -0
project/models/rnn.py +180 -0
project/models/seq2seq.py +201 -0
project/models/setups.py +65 -0
project/recsys/algorithms.py +92 -0
project/recsys/helper.py +92 -0

.editorconfig ADDED Viewed

	@@ -0,0 +1,12 @@

+# top-most EditorConfig file
+root = true
+# Unix-style newlines with a newline ending every file
+[*]
+end_of_line = lf
+insert_final_newline = true
+# 4 space indentation
+[*.py]
+indent_style = tab
+indent_size = 4

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+dataset/*
+tmp/*
+**/*.pyc
+**/*.cpython-37.pyc
+.ipynb_checkpoints/*
+.history
+.vscode
+tmp
+project/data/__pycache__/*.pyc
+project/evaluation/__pycache__/*.pyc
+project/recsys/__pycache__/*.pyc
+project/__pycache__/*.pyc

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2020 Igor André P. Santana
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,83 @@
 ---
-license: unlicense
----

+# RNN Embeddings
+## Jointly learning music embeddings with Recurrent Neural Networks
+This repository contains all the code that I did during my masters @ State University of Maringá. I do not intend to add new features to this project, as I will not continue this project in a PhD. To better understand what is the goal of this project, this quote is from my thesis and summarizes what I did:
+> This work's goal is to use Recurrent Neural Networks to acquire contextual information for each song, given the sequence of songs that each user has listened to using embeddings.
+If you have any doubts about the code, or want to use it in your project, let me know! I will be glad to help you in anything you need.
+### Installation and Setup
+As this code was written in Python, I highly recommend you to use [conda](https://docs.conda.io/en/latest/) to install all the dependencies that you'll need to run it. I have provided the [environment file](environment.yml) that I ended up with, and to create the repository using this file, you should run the following command (assuming you already have conda):
+```
+conda env create -f environment.yml
+```
+It is important to know that I used Tensorflow 1.14.0, Cuda 9.2 and Python 3.6.9 to run the experiments. If you cannot run with the environment file that I have provided, perhaps its because one of those versions.
+### Directory Structure and General Instructions
+```
+.
+|-- analysis
+|-- configs
+|-- dataset
+|   |-- dataset #1
+|   |-- dataset #2
+|   `-- ...
+|-- outputs
+|-- project
+|   |-- data
+|   |-- evaluation
+|   |-- models
+|   `-- recsys
+|-- tmp
+```
+This project follows this directory structure in order to work. The main python files are in the **project** folder, and any change that you'll want to do in the code must be done in the files in this folder. The **outputs** folder will contain the output file for the models that you built.
+The **dataset** contains all the datasets that you'll use in the project, and for each dataset, you should create a separate folder for it inside the **dataset** folder. The project will then look for a `listening_history.csv` file inside of this folder to run it. This file **must be** comma-separated.
+A temporary folder, **tmp**, will be created while the project works. For each dataset that you'll run this project with, a folder inside the **tmp** folder will be created. There you can find the cross-validation folds, the models that you built and the individual recommendations for each user, as well as some auxiliary matrixes used in the UserKNN algorithm.
+I have also included an **analysis** folder that I used to create some graphs with the results. You just have to point to the `main.py` file in the analysis folder where are the results, and it will show an graphical comparison between the models with all the metrics.
+The project will only work if you provide a configuration file to it. In my case, I stored my configuration files in the **configs** folder, but feel free to delete the folder if you don' want it. The configuration file contains the parameters for the models, and I don't recommend deleting any parameter even if you are not going to use it. I've included a [sample configuration](configs/config_sample.yml) file that you can use as guideline for your project.
+To run the project, you have to pass the config to the `main.py` as a parameter.
+```
+$ python main.py --config=configs/config_sample.yml
+```
+###### DISCLAIMER:
+The `model` and `bi` parameters in the `models/rnn` configuration object are not working, as I hardcoded it in my project. If you want to change the layer (to a GRU or a Simple RNN), you should do it [directly in the code](project/models/rnn.py#L147).
+### What is included in this project?
+To better understand the project, I highly recommend you to go check the work that I used as a baseline for my model:
+- [link](https://doi.org/10.1007/s10791-017-9317-7) -  Wang, D., Deng, S. & Xu, G. Sequence-based context-aware music recommendation. Information Retrieval Journal (2018)
+Their work, *music2vec*, is one of the baselines for my RNN model. The following embeddings are implemented in this project:
+- music2vec
+- doc2vec - [link](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
+- GloVe - [link](https://nlp.stanford.edu/projects/glove/)
+To evaluate these embeddings models, the CARS that are implemented are the ones that were proposed by Wang et. al (M-TN, SM-TN, CSM-TN, CSM-UK). Besides the metrics that were used in the paper, I have included MAP, NDCG@5 and Precision@5 as well. The cutoff of these metrics is not configurable, sorry.
 ---
+If you have any doubts about this project, feel free to contact me!

analysis/main.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import pandas as pd
+import numpy as np
+import seaborn as sns
+import matplotlib.pyplot as plt
+sns.set(font_scale =1, style='whitegrid', context='paper')
+colors = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71", '#f1c40f']
+palette = sns.color_palette(colors)
+df 				= pd.read_csv('data/xiami.csv', sep='\t')
+df['id'] 		= df.params
+mtn         = df[df.algo == 'm2vTN'][['id','prec','rec', 'f1']]
+mtn 		= pd.DataFrame(mtn.groupby(by='id').mean())
+mtn['id'] 	= mtn.index
+smtn 		= df[df.algo == 'sm2vTN'][['id','prec','rec', 'f1']]
+smtn 		= pd.DataFrame(smtn.groupby(by='id').mean())
+smtn['id'] 	= smtn.index
+csmtn 	    = df[df.algo == 'csm2vTN'][['id','prec','rec', 'f1']]
+csmtn 		= pd.DataFrame(csmtn.groupby(by='id').mean())
+csmtn['id']	= csmtn.index
+csmuk		= df[df.algo == 'csm2vUK'][['id','prec','rec', 'f1']]
+csmuk 		= pd.DataFrame(csmuk.groupby(by='id').mean())
+csmuk['id']	= csmuk.index
+mtn.sort_index(ascending=False, inplace=True)
+smtn.sort_index(ascending=False, inplace=True)
+csmtn.sort_index(ascending=False, inplace=True)
+csmuk.sort_index(ascending=False, inplace=True)
+melt_mtn 	= pd.melt(mtn, id_vars='id')
+melt_smtn 	= pd.melt(smtn, id_vars='id')
+melt_csmtn 	= pd.melt(csmtn, id_vars='id')
+melt_csmuk 	= pd.melt(csmuk, id_vars='id')
+fig, axes = plt.subplots(2, 2, figsize=(25, 25))
+a1 = sns.catplot(x='variable', y='value', hue='id', data=melt_mtn, kind='bar', palette=palette, ax=axes[0][0])
+a2 = sns.catplot(x='variable', y='value', hue='id', data=melt_smtn, kind='bar', palette=palette, ax=axes[0][1])
+a3 = sns.catplot(x='variable', y='value', hue='id', data=melt_csmtn, kind='bar', palette=palette, ax=axes[1][0])
+a4 = sns.catplot(x='variable', y='value', hue='id', data=melt_csmuk, kind='bar', palette=palette, ax=axes[1][1])
+plt.close(2)
+plt.close(3)
+plt.close(4)
+plt.close(5)
+titles = ['M-TN', 'SM-TN', 'CSM-TN', 'CSM-UK']
+last = axes.flatten()[-1]
+handles, labels = last.get_legend_handles_labels()
+fig.legend(handles, labels, loc='upper left')
+i=0
+for ax in axes.flatten():
+	ax.get_legend().remove()
+	ax.set(yticks=np.arange(0, 0.21, 0.025))
+	ax.set(xlabel='Metrics Used', ylabel='Valor')
+	ax.set(title=titles[i])
+	i+=1
+plt.subplots_adjust(hspace=0.4)
+fig.suptitle('Metrics', fontsize=18, y=.98)
+plt.show()

configs/config_sample.yml ADDED Viewed

	@@ -0,0 +1,46 @@

+models:
+  rnn:
+    embedding_dim: [256]
+    batch: 64
+    epochs: [50]
+    model: ['LSTM']
+    window: [3]
+    bi: [False]
+    num_units: [512]
+  music2vec:
+    window: [5]
+    epochs: [5]
+    down_sample: [1e-3]
+    learning_rate: [0.025]
+    embedding_dim: [300]
+    negative_sample: [20]
+  doc2vec:
+    window: [10]
+    epochs: [10]
+    down_sample: [1e-4]
+    learning_rate: [0.025]
+    embedding_dim: [50]
+    negative_sample: [10]
+  glove:
+    window: [10]
+    embedding_dim: [100]
+    epochs: [15]
+    learning_rate: [0.025]
+session:
+  interval: 30
+evaluation:
+  dataset: 'sample'
+  cross-validation: 5
+  k: 5
+  topN: 5
+results:
+  full: 'outputs/sample.csv'
+embeddings:
+  music2vec:
+    usage: True
+  doc2vec:
+    usage: False
+  glove:
+    usage: False
+  rnn:
+    usage: False

environment.yml ADDED Viewed

	@@ -0,0 +1,129 @@

+name: rnn-embeddings
+channels:
+  - anaconda
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _tflow_select=2.1.0=gpu
+  - absl-py=0.8.1=py36_0
+  - astor=0.8.0=py36_0
+  - astroid=2.3.3=py36_0
+  - blas=1.0=mkl
+  - c-ares=1.15.0=h7b6447c_1001
+  - ca-certificates=2019.11.27=0
+  - cairo=1.14.12=h8948797_3
+  - certifi=2019.11.28=py36_0
+  - cudatoolkit=9.2=0
+  - cudnn=7.6.4=cuda9.2_0
+  - cupti=9.2.148=0
+  - cycler=0.10.0=py36_0
+  - dbus=1.13.12=h746ee38_0
+  - expat=2.2.6=he6710b0_0
+  - fontconfig=2.13.0=h9420a91_0
+  - freetype=2.9.1=h8a8886c_1
+  - fribidi=1.0.5=h7b6447c_0
+  - gast=0.3.2=py_0
+  - glib=2.63.1=h5a9c865_0
+  - google-pasta=0.1.8=py_0
+  - graphite2=1.3.13=h23475e2_0
+  - graphviz=2.40.1=h21bd128_2
+  - grpcio=1.16.1=py36hf8bcb03_1
+  - gst-plugins-base=1.14.0=hbbd80ab_1
+  - gstreamer=1.14.0=hb453b48_1
+  - h5py=2.9.0=py36h7918eee_0
+  - harfbuzz=1.8.8=hffaf4a1_0
+  - hdf5=1.10.4=hb1b8bf9_0
+  - icu=58.2=h9c2bf20_1
+  - intel-openmp=2019.4=243
+  - isort=4.3.21=py36_0
+  - joblib=0.14.0=py_0
+  - jpeg=9b=h024ee3a_2
+  - keras=2.2.4=0
+  - keras-applications=1.0.8=py_0
+  - keras-base=2.2.4=py36_0
+  - keras-preprocessing=1.1.0=py_1
+  - kiwisolver=1.1.0=py36he6710b0_0
+  - lazy-object-proxy=1.4.3=py36h7b6447c_0
+  - libedit=3.1.20181209=hc058e9b_0
+  - libffi=3.2.1=hd88cf55_4
+  - libgcc-ng=9.1.0=hdf63c60_0
+  - libgfortran-ng=7.3.0=hdf63c60_0
+  - libpng=1.6.37=hbc83047_0
+  - libprotobuf=3.10.1=hd408876_0
+  - libstdcxx-ng=9.1.0=hdf63c60_0
+  - libtiff=4.1.0=h2733197_0
+  - libuuid=1.0.3=h1bed415_2
+  - libxcb=1.13=h1bed415_1
+  - libxml2=2.9.9=hea5a465_1
+  - markdown=3.1.1=py36_0
+  - matplotlib=3.1.1=py36h5429711_0
+  - mccabe=0.6.1=py36_1
+  - mkl=2019.4=243
+  - mkl-service=2.3.0=py36he904b0f_0
+  - mkl_fft=1.0.15=py36ha843d7b_0
+  - mkl_random=1.1.0=py36hd6b4f25_0
+  - mock=3.0.5=py36_0
+  - ncurses=6.1=he6710b0_1
+  - openssl=1.1.1=h7b6447c_0
+  - pandas=0.25.3=py36he6710b0_0
+  - pango=1.42.4=h049681c_0
+  - patsy=0.5.1=py36_0
+  - pcre=8.43=he6710b0_0
+  - pip=19.3.1=py36_0
+  - pixman=0.38.0=h7b6447c_0
+  - protobuf=3.10.1=py36he6710b0_0
+  - pylint=2.4.4=py36_0
+  - pyparsing=2.4.5=py_0
+  - pyqt=5.9.2=py36h05f1152_2
+  - python=3.6.9=h265db76_0
+  - pytz=2019.3=py_0
+  - qt=5.9.7=h5867ecd_1
+  - readline=7.0=h7b6447c_5
+  - scikit-learn=0.21.3=py36hd81dba3_0
+  - scipy=1.3.1=py36h7c811a0_0
+  - seaborn=0.9.0=pyh91ea838_1
+  - setuptools=42.0.2=py36_0
+  - sip=4.19.8=py36hf484d3e_0
+  - six=1.13.0=py36_0
+  - sqlite=3.30.1=h7b6447c_0
+  - statsmodels=0.10.1=py36hdd07704_0
+  - tensorboard=1.14.0=py36hf484d3e_0
+  - tensorflow=1.14.0=gpu_py36hfc5689a_0
+  - tensorflow-base=1.14.0=gpu_py36h611c6d2_0
+  - tensorflow-estimator=1.14.0=py_0
+  - tensorflow-gpu=1.14.0=h0d30ee6_0
+  - termcolor=1.1.0=py36_1
+  - tk=8.6.8=hbc83047_0
+  - tornado=6.0.3=py36h7b6447c_0
+  - typed-ast=1.4.0=py36h7b6447c_0
+  - werkzeug=0.16.0=py_0
+  - wheel=0.33.6=py36_0
+  - wrapt=1.11.2=py36h7b6447c_0
+  - xz=5.2.4=h14c3975_4
+  - yaml=0.1.7=had09818_2
+  - zlib=1.2.11=h7b6447c_3
+  - zstd=1.3.7=h0b5b093_0
+  - pip:
+    - bilm==0.1.post5
+    - blessings==1.7
+    - boto==2.49.0
+    - boto3==1.10.33
+    - botocore==1.13.33
+    - chardet==3.0.4
+    - docutils==0.15.2
+    - gensim==3.8.1
+    - glove-python==0.1.0
+    - gpustat==0.6.0
+    - idna==2.8
+    - jmespath==0.9.4
+    - ml-metrics==0.1.4
+    - numpy==1.16.4
+    - nvidia-ml-py3==7.352.0
+    - psutil==5.6.7
+    - pydot==1.4.1
+    - python-dateutil==2.8.0
+    - pyyaml>=3.11, <6.0
+    - requests==2.22.0
+    - s3transfer==0.2.1
+    - smart-open==1.9.0
+    - urllib3==1.25.7

main.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import re
+import os
+import yaml
+import pickle
+import argparse
+import pandas                               as pd
+import numpy                                as np
+import multiprocessing                      as mp
+import project.evaluation.run               as r
+from    os.path                     import exists
+from    datetime                    import datetime
+from    project.data.preprocess     import preprocess, remove_sessions
+from    project.models.embeddings   import embeddings
+from    project.evaluation.run      import cross_validation
+if __name__ == '__main__':
+    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
+    parser = argparse.ArgumentParser(description='RNN Embeddings')
+    parser.add_argument('--config', help='Configuration file', type=str)
+    args = parser.parse_args()
+    conf = yaml.safe_load(open(args.config))
+    print('The configuration file "%s" was read.' % args.config)
+    print('Pre-process started for dataset "%s"' % conf['evaluation']['dataset'])
+    preprocess(conf)
+    ds  = conf['evaluation']['dataset']
+    df  = pd.read_csv('dataset/{}/session_listening_history.csv'.format(ds), sep = ',')
+    emb_path = 'tmp/{}/models/ids.npy'.format(ds)
+    if not exists(emb_path):
+    	embeddings(df, conf)
+    ids = np.load(emb_path)
+    cross_validation(df, conf, ids)

project/__init__.py ADDED Viewed

File without changes

project/data/preparation.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import pandas as pd
+import random
+import numpy as np
+import pickle
+from os                                 import makedirs
+from os.path                            import exists
+from gensim.models                      import Word2Vec, Doc2Vec
+from glove 								import Glove
+from sklearn.model_selection            import KFold
+def _rnn_load(path, songs):
+    data = pickle.load(open(path, 'rb'))
+    emb_dict = {}
+    for song in songs:
+      emb_dict[song] = data[song]
+    return emb_dict
+def __w2v_load(path, songs):
+    wv = Word2Vec.load(path).wv
+    emb_dict = {}
+    for song in songs:
+      emb_dict[song] = wv[song]
+    return emb_dict
+def __g_load(path, songs):
+    glove = Glove.load(path)
+    emb_dict = {}
+    for song in songs:
+        emb_dict[song] = glove.word_vectors[glove.dictionary[song]]
+    return emb_dict
+def __load_exp(path, songs):
+    data = pickle.load(open(path, 'rb'))
+    return data
+def get_embeddings(path, songs):
+    path_arr        = path.split('/')
+    session_file    = '/'.join(path_arr[:-1] + ['s' + path_arr[-1]])
+    user_file       = path
+    if 'experiments' in path:
+        return __load_exp(user_file, songs), __load_exp(session_file, songs)
+    if 'glove' in path:
+        return __g_load(user_file, songs),__g_load(session_file, songs)
+    if 'music2vec' in path:
+        return __w2v_load(user_file, songs), __w2v_load(session_file, songs)
+    if 'doc2vec' in path:
+        return __w2v_load(user_file, songs), __w2v_load(session_file, songs)
+    if 'rnn' in path:
+        return _rnn_load(user_file, songs), _rnn_load(session_file, songs)
+    return {},{}
+def prepare_data(df, conf):
+    ds                  = conf['evaluation']['dataset']
+    path_kfold          = 'tmp/{}/kfold/'.format(ds)
+    if exists(path_kfold):
+        kfold = []
+        for i in range(0, conf['evaluation']['k']):
+            j = i + 1
+            train = pd.read_pickle(path_kfold + 'train_{}.pkl'.format(j))
+            test  = pd.read_pickle(path_kfold + 'test_{}.pkl'.format(j))
+            kfold.append((train, test))
+        return kfold
+    makedirs('tmp/{}/kfold/'.format(ds))
+    sessions            = df.groupby('session')['song'].apply(lambda x: x.tolist())
+    users               = df.groupby('user').agg(list)
+    users['history']    = users['session'].apply(lambda x: [sessions[session] for session in list(set(x))])
+    users               = users.drop(['song', 'timestamp','session'], axis=1)
+    unique_users        = df.user.unique()
+    kf                  = KFold(n_splits=conf['evaluation']['k'], shuffle=True)
+    i       = 1
+    kfold   = []
+    for train, test in kf.split(unique_users):
+        train_df = users[users.index.isin(unique_users[train])]
+        test_df  = users[users.index.isin(unique_users[test])]
+        train_df.to_pickle('tmp/{}/kfold/train_{}.pkl'.format(ds, i))
+        test_df.to_pickle('tmp/{}/kfold/test_{}.pkl'.format(ds, i))
+        kfold.append((train_df, test_df))
+        i += 1
+    return kfold

project/data/preprocess.py ADDED Viewed

	@@ -0,0 +1,82 @@

+from os import path
+import csv
+import math
+import json
+import yaml
+import numpy 	as np
+import pandas   as pd
+import multiprocessing as mp
+from datetime import datetime, timedelta
+def remove_sessions(df, leq=1):
+    group   = df.groupby(by='session').agg(list)
+    group   = group['song'].apply(len)
+    to_stay = group[group > leq].index.values
+    return df[df.session.isin(to_stay)]
+def sessionize_user(ds, session_time, s_path):
+    df              = pd.read_csv('dataset/{}/listening_history.csv'.format(ds), sep = ',')
+    df['timestamp'] = df['timestamp'].astype('datetime64')
+    df['dif']       = df['timestamp'].diff()
+    df['session']   = df.apply(lambda x: 'NEW_SESSION' if x.dif >= timedelta(minutes=session_time) else 'SAME_SESSION', axis=1)
+    s_no = 0
+    l_u  = ''
+    f = open(s_path, 'w+')
+    print(','.join(['user', 'song', 'timestamp', 'session']), file=f)
+    print('Sessionized "%s" data file: %s' % (ds, s_path))
+    for row in df.values:
+        if s_no == 0:
+            l_u = row[0]
+        if (row[4] == 'NEW_SESSION' and l_u  == row[0]) or (l_u  != row[0]):
+            s_no+=1
+        row[3] = 's{}'.format(s_no)
+        l_u = row[0]
+        row[2] = str(row[2])
+        print(','.join(row[:-1]), file=f)
+def gen_seq_files(df, pwd, window_size):
+    c_sessions = df.groupby('session')['song'].agg(list)
+    u_sessions = df.groupby('user')['song'].agg(list)
+    num_w      = window_size // 2
+    fc         = open(pwd + 'c_seqs.csv', 'w+')
+    fu         = open(pwd + 'u_seqs.csv', 'w+')
+    dict_song  = {}
+    for session in c_sessions:
+        for ix in range(len(session)):
+            b4 = list(range(ix - num_w, ix))
+            af = list(range(ix + 1, ix + num_w + 1))
+            b4 = [session[i] if i >= 0 else '-' for i in b4]
+            af = [session[i] if i < len(session) else '-' for i in af]
+            if session[ix] not in dict_song:
+                dict_song[session[ix]] = []
+            dict_song[session[ix]].append(b4 + [session[ix]] + af)
+    for song, values in dict_song.items():
+        for seq in values:
+            print(song + '\t'+ '{}'.format(seq), file=fc)
+    dict_song  = {}
+    for session in u_sessions:
+        for ix in range(len(session)):
+            b4 = list(range(ix - num_w, ix))
+            af = list(range(ix + 1, ix + num_w + 1))
+            b4 = [session[i] if i >= 0 else '-' for i in b4]
+            af = [session[i] if i < len(session) else '-' for i in af]
+            if session[ix] not in dict_song:
+                dict_song[session[ix]] = []
+            dict_song[session[ix]].append(b4 + [session[ix]] + af)
+    for song, values in dict_song.items():
+        for seq in values:
+            print(song + '\t'+ '{}'.format(seq), file=fu)
+def preprocess(conf):
+    ds       = conf['evaluation']['dataset']
+    interval = conf['session']['interval']
+    if path.exists('dataset/{}/session_listening_history.csv'.format(ds)):
+        print('The "%s" dataset is already sessionized' % ds)
+        return
+    print('Started to sessionize dataset "%s"' % ds)
+    sessionize_user(ds, interval, 'dataset/{}/session_listening_history.csv'.format(ds))

project/evaluation/ResultReport.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import pandas as pd
+import numpy as np
+class Results():
+    def __init__(self, setups, k):
+        self.metrics    = {}
+        self.k          = k
+        self.final_df = pd.DataFrame()
+    def fold_results(self, params, m2vTN, sm2vTN, csm2vTN, csm2vUK, fold):
+        metrics = np.vstack([m2vTN, sm2vTN, csm2vTN, csm2vUK])
+        print()
+        data = {
+            'params': [params] * 4,
+            'algo': ['m2vTN','sm2vTN','csm2vTN','csm2vUK'],
+            'folds':[fold] * 4,
+            'prec': metrics[:,0],
+            'rec': metrics[:,1],
+            'f1': metrics[:,2],
+            'map': metrics[:,3],
+            'ndcg@5': metrics[:,4],
+            'p@5': metrics[:,5]
+        }
+        df = pd.DataFrame(data)
+        return df

project/evaluation/metrics.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from project.evaluation.ranking_metrics import mean_average_precision, ndcg_at, precision_at
+def __Prec(topn, test):
+  num_intersect = len(set.intersection(set(topn), set(test)))
+  num_rec       = len(topn)
+  return num_intersect / num_rec
+def __Rec(topn, test):
+  num_intersect   = len(set.intersection(set(topn), set(test)))
+  num_test        = len(list(set(test)))
+  return num_intersect / num_test
+def Hitrate(topn, test):
+  num_intersect = len([value for value in list(set(test)) if value in topn])
+  num_rec       = len(topn)
+  return num_intersect / num_rec
+def __F1(prec, rec):
+  return (2 * ((prec * rec) / (prec + rec))) if (prec + rec) > 0 else 0
+def get_metrics(topn, test):
+  prec    = __Prec(topn, test)
+  rec     = __Rec(topn, test)
+  f       = __F1(prec, rec)
+  MAP     = mean_average_precision([test], [topn], assume_unique=False)
+  ndcg_5  = ndcg_at([test], [topn], k=5, assume_unique=False)
+  p_5     = precision_at([test], [topn], k=5, assume_unique=False)
+  return [prec, rec, f, MAP, ndcg_5, p_5]

project/evaluation/ranking_metrics.py ADDED Viewed

	@@ -0,0 +1,240 @@

+# -*- coding: utf-8 -*-
+#
+# Author: Taylor G Smith
+#
+# Recommender system ranking metrics derived from Spark source for use with
+# Python-based recommender libraries (i.e., implicit,
+# http://github.com/benfred/implicit/). These metrics are derived from the
+# original Spark Scala source code for recommender metrics.
+# https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala
+import numpy as np
+import warnings
+__all__ = [
+    'mean_average_precision',
+    'ndcg_at',
+    'precision_at',
+]
+def _require_positive_k(k):
+    """Helper function to avoid copy/pasted code for validating K"""
+    if k <= 0:
+        raise ValueError("ranking position k should be positive")
+def _mean_ranking_metric(predictions, labels, metric):
+    """Helper function for precision_at_k and mean_average_precision"""
+    # do not zip, as this will require an extra pass of O(N). Just assert
+    # equal length and index (compute in ONE pass of O(N)).
+    # if len(predictions) != len(labels):
+    #     raise ValueError("dim mismatch in predictions and labels!")
+    # return np.mean([
+    #     metric(np.asarray(predictions[i]), np.asarray(labels[i]))
+    #     for i in xrange(len(predictions))
+    # ])
+    # Actually probably want lazy evaluation in case preds is a
+    # generator, since preds can be very dense and could blow up
+    # memory... but how to assert lengths equal? FIXME
+    return np.mean([
+        metric(np.asarray(prd), np.asarray(labels[i]))
+        for i, prd in enumerate(predictions)  # lazy eval if generator
+    ])
+def _warn_for_empty_labels():
+    """Helper for missing ground truth sets"""
+    warnings.warn("Empty ground truth set! Check input data")
+    return 0.
+def precision_at(predictions, labels, k=10, assume_unique=True):
+    """Compute the precision at K.
+    Compute the average precision of all the queries, truncated at
+    ranking position k. If for a query, the ranking algorithm returns
+    n (n is less than k) results, the precision value will be computed
+    as #(relevant items retrieved) / k. This formula also applies when
+    the size of the ground truth set is less than k.
+    If a query has an empty ground truth set, zero will be used as
+    precision together with a warning.
+    Parameters
+    ----------
+    predictions : array-like, shape=(n_predictions,)
+        The prediction array. The items that were predicted, in descending
+        order of relevance.
+    labels : array-like, shape=(n_ratings,)
+        The labels (positively-rated items).
+    k : int, optional (default=10)
+        The rank at which to measure the precision.
+    assume_unique : bool, optional (default=True)
+        Whether to assume the items in the labels and predictions are each
+        unique. That is, the same item is not predicted multiple times or
+        rated multiple times.
+    Examples
+    --------
+    >>> # predictions for 3 users
+    >>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
+    ...          [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
+    ...          [1, 2, 3, 4, 5]]
+    >>> # labels for the 3 users
+    >>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
+    >>> precision_at(preds, labels, 1)
+    0.33333333333333331
+    >>> precision_at(preds, labels, 5)
+    0.26666666666666666
+    >>> precision_at(preds, labels, 15)
+    0.17777777777777778
+    """
+    # validate K
+    _require_positive_k(k)
+    def _inner_pk(pred, lab):
+        # need to compute the count of the number of values in the predictions
+        # that are present in the labels. We'll use numpy in1d for this (set
+        # intersection in O(1))
+        if lab.shape[0] > 0:
+            n = min(pred.shape[0], k)
+            cnt = np.in1d(pred[:n], lab, assume_unique=assume_unique).sum()
+            return float(cnt) / k
+        else:
+            return _warn_for_empty_labels()
+    return _mean_ranking_metric(predictions, labels, _inner_pk)
+def mean_average_precision(predictions, labels, assume_unique=True):
+    """Compute the mean average precision on predictions and labels.
+    Returns the mean average precision (MAP) of all the queries. If a query
+    has an empty ground truth set, the average precision will be zero and a
+    warning is generated.
+    Parameters
+    ----------
+    predictions : array-like, shape=(n_predictions,)
+        The prediction array. The items that were predicted, in descending
+        order of relevance.
+    labels : array-like, shape=(n_ratings,)
+        The labels (positively-rated items).
+    assume_unique : bool, optional (default=True)
+        Whether to assume the items in the labels and predictions are each
+        unique. That is, the same item is not predicted multiple times or
+        rated multiple times.
+    Examples
+    --------
+    >>> # predictions for 3 users
+    >>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
+    ...          [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
+    ...          [1, 2, 3, 4, 5]]
+    >>> # labels for the 3 users
+    >>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
+    >>> mean_average_precision(preds, labels)
+    0.35502645502645497
+    """
+    def _inner_map(pred, lab):
+        if lab.shape[0]:
+            # compute the number of elements within the predictions that are
+            # present in the actual labels, and get the cumulative sum weighted
+            # by the index of the ranking
+            n = pred.shape[0]
+            # Scala code from Spark source:
+            # var i = 0
+            # var cnt = 0
+            # var precSum = 0.0
+            # val n = pred.length
+            # while (i < n) {
+            #     if (labSet.contains(pred(i))) {
+            #         cnt += 1
+            #         precSum += cnt.toDouble / (i + 1)
+            #     }
+            #     i += 1
+            # }
+            # precSum / labSet.size
+            arange = np.arange(n, dtype=np.float32) + 1.  # this is the denom
+            present = np.in1d(pred[:n], lab, assume_unique=assume_unique)
+            prec_sum = np.ones(present.sum()).cumsum()
+            denom = arange[present]
+            return (prec_sum / denom).sum() / lab.shape[0]
+        else:
+            return _warn_for_empty_labels()
+    return _mean_ranking_metric(predictions, labels, _inner_map)
+def ndcg_at(predictions, labels, k=10, assume_unique=True):
+    """Compute the normalized discounted cumulative gain at K.
+    Compute the average NDCG value of all the queries, truncated at ranking
+    position k. The discounted cumulative gain at position k is computed as:
+        sum,,i=1,,^k^ (2^{relevance of ''i''th item}^ - 1) / log(i + 1)
+    and the NDCG is obtained by dividing the DCG value on the ground truth set.
+    In the current implementation, the relevance value is binary.
+    If a query has an empty ground truth set, zero will be used as
+    NDCG together with a warning.
+    Parameters
+    ----------
+    predictions : array-like, shape=(n_predictions,)
+        The prediction array. The items that were predicted, in descending
+        order of relevance.
+    labels : array-like, shape=(n_ratings,)
+        The labels (positively-rated items).
+    k : int, optional (default=10)
+        The rank at which to measure the NDCG.
+    assume_unique : bool, optional (default=True)
+        Whether to assume the items in the labels and predictions are each
+        unique. That is, the same item is not predicted multiple times or
+        rated multiple times.
+    Examples
+    --------
+    >>> # predictions for 3 users
+    >>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
+    ...          [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
+    ...          [1, 2, 3, 4, 5]]
+    >>> # labels for the 3 users
+    >>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
+    >>> ndcg_at(preds, labels, 3)
+    0.3333333432674408
+    >>> ndcg_at(preds, labels, 10)
+    0.48791273434956867
+    References
+    ----------
+    .. [1] K. Jarvelin and J. Kekalainen, "IR evaluation methods for
+           retrieving highly relevant documents."
+    """
+    # validate K
+    _require_positive_k(k)
+    def _inner_ndcg(pred, lab):
+        if lab.shape[0]:
+            # if we do NOT assume uniqueness, the set is a bit different here
+            if not assume_unique:
+                lab = np.unique(lab)
+            n_lab = lab.shape[0]
+            n_pred = pred.shape[0]
+            n = min(max(n_pred, n_lab), k)  # min(min(p, l), k)?
+            # similar to mean_avg_prcsn, we need an arange, but this time +2
+            # since python is zero-indexed, and the denom typically needs +1.
+            # Also need the log base2...
+            arange = np.arange(n, dtype=np.float32)  # length n
+            # since we are only interested in the arange up to n_pred, truncate
+            # if necessary
+            arange = arange[:n_pred]
+            denom = np.log2(arange + 2.)  # length n
+            gains = 1. / denom  # length n
+            # compute the gains where the prediction is present in the labels
+            dcg_mask = np.in1d(pred[:n], lab, assume_unique=assume_unique)
+            dcg = gains[dcg_mask].sum()
+            # the max DCG is sum of gains where the index < the label set size
+            max_dcg = gains[arange < n_lab].sum()
+            return dcg / max_dcg
+        else:
+            return _warn_for_empty_labels()
+    return _mean_ranking_metric(predictions, labels, _inner_ndcg)

project/evaluation/run.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import sys
+import csv
+import os
+import yaml
+import pickle
+import numpy                            as np
+import pandas                           as pd
+import project.evaluation.metrics       as m
+from os.path                        	import exists
+from project.data.preparation			import prepare_data, get_embeddings
+from project.recsys.helper              import Helper
+from datetime                           import datetime
+from project.recsys.algorithms        	import execute_algo
+from project.evaluation.ResultReport	import Results
+from keras.models                   	import model_from_yaml
+def get_rnn():
+    model = model_from_yaml(open('training_model.yaml','r'))
+    model.load_weights('training_weights.h5')
+    return model
+def skip_all(executed, params, k):
+	folds = executed[executed['params'] == params]['folds']
+	return folds.max() == k
+def skip_fold(executed, params, fold):
+	folds = executed[executed['params'] == params]['folds']
+	return folds.max() >= fold
+def cross_validation(df, conf, setups):
+	params 			= conf['evaluation']
+	r_paths			= conf['results']
+	kfold			= prepare_data(df, conf)
+	dataset 		= params['dataset']
+	topN			= int(params['topN'])
+	k				= int(params['k'])
+	results 		= Results(setups, k)
+	exec_path		= r_paths['full']
+	pwd_rec 		= 'tmp/{}/rec/'.format(dataset)
+	if not exists(pwd_rec):
+		os.mkdir(pwd_rec)
+	if not exists(exec_path):
+		pd.DataFrame({},columns=['params','algo','folds','prec','rec','f1','map','ndcg@5','p@5']).to_csv(exec_path,index=None,sep='\t')
+	executed = pd.read_csv(exec_path, sep='\t')
+	for setup in setups:
+		_, params, path	= setup
+		if not exists(pwd_rec + params):
+			os.mkdir(pwd_rec + params)
+		if skip_all(executed, params, k):
+			continue
+		songs		= df['song'].unique().tolist()
+		m2v, sm2v   = get_embeddings(path, songs)
+		songs       = pd.DataFrame({ 'm2v': [m2v[x] for x in songs], 'sm2v': [sm2v[x] for x in songs]}, index=songs, columns=['m2v','sm2v'])
+		fold = 1
+		for train, test in kfold:
+			if skip_fold(executed, params, fold):
+				fold+=1
+				continue
+			time = datetime.now().strftime('%d/%m/%Y %H:%M')
+			print('%s | fold-%d | Running recsys w/ k-fold with the following params: %s' % (time, fold, params))
+			helper 	= Helper(train, test, songs, dataset)
+			m2vTN, sm2vTN, csm2vTN, csm2vUK = execute_algo(train.index, test.index, songs, topN, k, helper, pwd_rec + params)
+			res 							= results.fold_results(params, m2vTN, sm2vTN, csm2vTN, csm2vUK, fold)
+			res.to_csv(exec_path, sep='\t', mode='a', index=None, header=None)
+			fold+=1

project/models/embeddings.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import sys
+import pickle
+import pandas                       as pd
+import numpy                        as np
+from os                             import makedirs
+from os.path                        import exists
+from gensim.models                  import Word2Vec, Doc2Vec
+from gensim.models.doc2vec          import TaggedDocument
+from datetime                       import datetime
+from glove                          import Glove, Corpus
+from project.models.rnn             import rnn
+from project.models.setups          import Setups
+from project.models.seq2seq         import start as rnn_start
+def data_prep(model, df):
+    if model == 'user':
+        return df.groupby(by='user')['song'].apply(list).values.tolist()
+    if model == 'user_doc':
+        return df.groupby(by='user')['song'].apply(lambda x: TaggedDocument(words=x.tolist(), tags=[x.name])).values.tolist()
+    if model == 'session':
+        return df.groupby(by='session')['song'].apply(list).values.tolist()
+    if model == 'session_doc':
+        return df.groupby(by='session')['song'].apply(lambda x: TaggedDocument(words=x.tolist(), tags=[x.name])).values.tolist()
+def music2vec(data, w2v_type, dim, lr, window, down, neg_sample, epochs):
+    sentences = data_prep(w2v_type, data)
+    return Word2Vec(sentences, size=dim, alpha=lr, window=window, sample=down,
+                    sg=1, hs=0, negative=neg_sample, iter=epochs, min_count=1, compute_loss=True)
+def doc2vec(data, d2v_type, dim, lr, window, down, neg_sample, epochs):
+    sequence = data_prep(d2v_type, data)
+    return Doc2Vec(sequence, dm=1, vector_size=dim, alpha=lr, window=window, sample=down,
+                    negative=neg_sample, epochs=epochs, min_count=1, compute_loss=True)
+def glove(data, glove_type, window, dim, lr, epochs):
+    sentences = data_prep(glove_type, data)
+    corpus = Corpus()
+    corpus.fit(sentences, window=window)
+    glove = Glove(no_components=dim, learning_rate=lr)
+    glove.fit(corpus.matrix, epochs=epochs, no_threads=4, verbose=True)
+    glove.add_dictionary(corpus.dictionary)
+    return glove
+def embeddings(df, conf):
+    ds          = conf['evaluation']['dataset']
+    cwd         = 'tmp/{}/models'.format(ds)
+    if not exists(cwd):
+        makedirs(cwd)
+    setups      = Setups(conf)
+    generators  = setups.get_generators()
+    c_id = 0
+    setups_id = []
+    for method, generator in generators:
+        if method == 'rnn':
+            for s in generator:
+                to_str  = setups.setup_to_string(c_id, s, method)
+                print(to_str)
+                path    = '{}/{}__{}.pickle'.format(cwd, method, c_id)
+                path_s  = '{}/s{}__{}.pickle'.format(cwd, method, c_id)
+                if not exists(path):
+                    user, session = rnn(df, ds, s['model'], s['window'], s['epochs'],
+                                        s['batch'], s['dim'], s['num_units'], s['bidi'])
+                    fu = open(path, 'wb')
+                    fs = open(path_s, 'wb')
+                    pickle.dump(user, fu, protocol=pickle.HIGHEST_PROTOCOL)
+                    pickle.dump(session, fs, protocol=pickle.HIGHEST_PROTOCOL)
+                    fu.close()
+                    fs.close()
+                setups_id.append([c_id, to_str, path])
+                c_id+=1
+        if method == 'music2vec':
+            for s in generator:
+                to_str  = setups.setup_to_string(c_id, s, method)
+                print(to_str)
+                path    = '{}/{}__{}.model'.format(cwd, method, c_id)
+                path_s  = '{}/s{}__{}.model'.format(cwd, method, c_id)
+                if not exists(path):
+                    m2v  = music2vec(df,'user', s['dim'], s['lr'], s['window'], s['down'], s['neg_sample'], s['epochs'])
+                    sm2v = music2vec(df,'session', s['dim'], s['lr'], s['window'], s['down'], s['neg_sample'], s['epochs'])
+                    m2v.save(path)
+                    sm2v.save(path_s)
+                setups_id.append([c_id, to_str, path])
+                c_id+=1
+        if method == 'doc2vec':
+            for s in generator:
+                to_str  = setups.setup_to_string(c_id, s, method)
+                path    = '{}/{}__{}.model'.format(cwd, method, c_id)
+                path_s  = '{}/s{}__{}.model'.format(cwd, method, c_id)
+                print(to_str)
+                if not exists(path):
+                    d2v = doc2vec(df,'user_doc', s['dim'], s['lr'], s['window'], s['down'], s['neg_sample'], s['epochs'])
+                    sd2v = doc2vec(df,'session_doc', s['dim'], s['lr'], s['window'], s['down'], s['neg_sample'], s['epochs'])
+                    d2v.save(path)
+                    sd2v.save(path_s)
+                setups_id.append([c_id, to_str, path])
+                c_id+=1
+        if method == 'glove':
+            for s in generator:
+                to_str  = setups.setup_to_string(c_id, s, method)
+                path    = '{}/{}__{}.model'.format(cwd, method, c_id)
+                path_s  = '{}/s{}__{}.model'.format(cwd, method, c_id)
+                print(to_str)
+                if not exists(path):
+                    glv = glove(df, 'user', s['window'], s['dim'], s['lr'], s['epochs'])
+                    sglv = glove(df, 'session', s['window'], s['dim'], s['lr'], s['epochs'])
+                    glv.save(path)
+                    sglv.save(path_s)
+                c_id+=1
+        if method == 'genres':
+            for s in generator:
+                to_str  = s
+                print(to_str)
+                path    = 'tmp/{}/experiments/'.format(ds)
+                path_s  = 'tmp/{}/experiments/'.format(ds)
+                if s == 'add-all':
+                    path    += 'all_genres/add/all_add.pickle'
+                    path_s  += 'all_genres/add/sall_add.pickle'
+                if s == 'mul-all':
+                    path    += 'all_genres/mul/all_mul.pickle'
+                    path_s  += 'all_genres/mul/sall_mul.pickle'
+                if s == 'avg-all':
+                    path    += 'all_genres/avg/all_avg.pickle'
+                    path_s  += 'all_genres/avg/sall_avg.pickle'
+                if s == 'add-ran':
+                    path    += 'random_genres/add/ran_add.pickle'
+                    path_s  += 'random_genres/add/sran_add.pickle'
+                if s == 'mul-ran':
+                    path    += 'random_genres/mul/ran_mul.pickle'
+                    path_s  += 'random_genres/mul/sran_mul.pickle'
+                if s == 'avg-ran':
+                    path    += 'random_genres/avg/ran_avg.pickle'
+                    path_s  += 'random_genres/avg/sran_avg.pickle'
+                setups_id.append([c_id, to_str, path])
+                c_id+=1
+    setups_id = np.stack(setups_id, axis=0)
+    np.save('{}/ids'.format(cwd), setups_id)

project/models/rnn.py ADDED Viewed

	@@ -0,0 +1,180 @@

+from os.path                        import exists
+from keras.utils                    import to_categorical
+from keras.models                   import Model
+from keras.layers                   import Embedding, LSTM, Dense, CuDNNGRU, LSTM, Input, Bidirectional, Dropout, Concatenate, Bidirectional
+from keras.models                   import Sequential, load_model
+from keras.callbacks                import EarlyStopping, ModelCheckpoint
+from keras.preprocessing.sequence   import TimeseriesGenerator
+import concurrent.futures as fut
+import os
+import gc
+import keras
+import pickle
+import time
+import numpy        as np
+import pickle       as pk
+import pandas       as pd
+import tensorflow   as tf
+import matplotlib.pyplot as plt
+from math 			import floor
+tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '4'
+def get_window(playlist, ix, window):
+	el = playlist[ix]
+	# This is the perfect case:
+	if (ix - window >= 0) and (ix + window + 1) < len(playlist):
+		window = playlist[ix - window:ix] + playlist[ix + 1:(ix + 1) + window]
+		return window
+	# Not running into the perfect case, will turn into the damage reduction clause:
+	b4      = []
+	after   = []
+	# If the problem is in the before clause, prepend the song until it mets the window size.
+	if (ix - window < 0):
+		b4 = (abs(ix - window) * ['0']) + playlist[0:ix]
+	else:
+		b4 = playlist[ix - window:ix]
+	# If the problem is in the after clause, append the song until it mets the window size.
+	if (ix + window + 1) > len(playlist):
+		num		=	(ix + window + 1) - len(playlist)
+		after 	= 	playlist[ix + 1:ix + window + 1] + (num * ['0'])
+	else:
+		after 	= 	playlist[ix + 1:(ix + 1) + window]
+	return b4 + after
+def window_seqs(sequence, w_size):
+	ix = 0
+	max_ix = (len(sequence) - 1) - w_size
+	x = []
+	y = []
+	while ix < max_ix:
+		x.append(sequence[ix:ix+w_size])
+		y.append([sequence[ix+w_size]])
+		ix+=1
+	return x, y
+def rnn(df, DS, MODEL, W_SIZE, EPOCHS, BATCH_SIZE, EMBEDDING_DIM, NUM_UNITS, BIDIRECTIONAL):
+	pwd 		= 'dataset/{}/'.format(DS)
+	WINDOW 		= W_SIZE * 2
+	vocab           = sorted(set(df.song.unique().tolist()))
+	vocab_size      = len(vocab) +1
+	song2ix         = {u:i for i, u in enumerate(vocab, 1)}
+	pickle.dump(song2ix, open('{}_song2ix.pickle'.format(DS), 'wb'), pickle.HIGHEST_PROTOCOL)
+	if not exists(pwd + 'song_context_{}.txt'.format(W_SIZE)):
+		df['song'] 		= df.song.apply(lambda song: song2ix[song])
+		u_playlists 	= df[['user', 'song']].groupby('user').agg(tuple)['song'].values
+		u_playlists		= [list(p) for p in u_playlists]
+		s_playlists 	= df[['session', 'song']].groupby('session').agg(tuple)['song'].values
+		s_playlists		= [list(p) for p in s_playlists]
+		nou_playlists 	= len(u_playlists)
+		nos_playlists 	= len(s_playlists)
+		user_windows       	= dict()
+		session_windows 	= dict()
+		for song in vocab:
+			user_windows[song2ix[song]] 	= []
+			session_windows[song2ix[song]] 	= []
+		k4 = 1
+		for pl in u_playlists:
+			print('[{}/{}] [USER] Playlist'.format(k4, nou_playlists), flush=False, end='\r')
+			k4+=1
+			ixes 			= range(0, len(pl))
+			s_windows = [(pl[ix], get_window(pl, ix, W_SIZE)) for ix in ixes]
+			for song, window in s_windows:
+				user_windows[song].append(window)
+		print()
+		k4 = 1
+		for pl in s_playlists:
+			print('[{}/{}] [SESSION] Playlist'.format(k4, nos_playlists), flush=False, end='\r')
+			k4+=1
+			ixes 			= range(0, len(pl))
+			s_windows = [(pl[ix], get_window(pl, ix, W_SIZE)) for ix in ixes]
+			for song, window in s_windows:
+				session_windows[song].append(window)
+		print()
+		f = open(pwd + 'song_context_{}.txt'.format(W_SIZE), 'w')
+		for song in vocab:
+			u_occurrences = user_windows[song2ix[song]]
+			s_occurrences = session_windows[song2ix[song]]
+			for u_o, s_o in zip(u_occurrences, s_occurrences):
+				print('{}\t{}\t{}'.format(','.join([str(i) for i in u_o]), ','.join([str(i) for i in s_o]), str(song2ix[song])), file=f)
+		f.close()
+	f  = open(pwd + 'song_context_{}.txt'.format(W_SIZE), mode='r')
+	data = []
+	for line in f:
+		line = line.replace('\n', '')
+		input_user, input_session, target = line.split('\t')
+		line = [np.array([int(x) for x in input_user.split(',')]), np.array([int(x) for x in input_session.split(',')]), int(target)]
+		data.append(line)
+	data = np.vstack(data)
+	np.random.shuffle(data)
+	def batch(data, bs):
+		while True:
+			for ix in range(0, len(data), bs):
+				u_input = data[ix:ix+bs,0]
+				s_input = data[ix:ix+bs,1]
+				target 	= data[ix:ix+bs,2]
+				yield [np.vstack(u_input), np.vstack(s_input)], to_categorical(target, num_classes=vocab_size)
+	train, test = data[int(len(data) *.20):], data[:int(len(data) *.20)]
+	input_session 			= Input(batch_shape=(None, WINDOW))
+	embedding_session 		= Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM, name='Session_Embeddings', mask_zero=True)(input_session)
+	drop_session 			= Dropout(0.2)(embedding_session)
+	rec_session 			= LSTM(NUM_UNITS, name='Session_LSTM')(drop_session)
+	drop_session 			= Dropout(0.2)(rec_session)
+	input_user 				= Input(batch_shape=(None, WINDOW))
+	embedding_user 			= Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM, name='User_Embeddings', mask_zero=True)(input_user)
+	drop_user				= Dropout(0.2)(embedding_user)
+	rec_user				= LSTM(NUM_UNITS, name='User_LSTM')(drop_user)
+	drop_user				= Dropout(0.2)(rec_user)
+	combination 			= Concatenate()([drop_session, drop_user])
+	dense       			= Dense(vocab_size, activation='softmax', name='Densa')(combination)
+	model       			= Model(inputs=[input_session, input_user], outputs=dense)
+	checkpoint 				= ModelCheckpoint('{}_model_checkpoint.h5'.format(DS), monitor='loss', verbose=0, save_best_only=False, period=1)
+	es          			= EarlyStopping(monitor='val_acc', mode='max', verbose=1, patience=5)
+	model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
+	model.summary()
+	if exists('{}_model_checkpoint.h5'.format(DS)):
+		model = load_model('{}_model_checkpoint.h5'.format(DS))
+	model.fit_generator(generator=batch(train, BATCH_SIZE), steps_per_epoch=len(train) // BATCH_SIZE, epochs=EPOCHS,
+						validation_data=batch(test, BATCH_SIZE), validation_steps=len(test) // BATCH_SIZE,  callbacks=[es, checkpoint])
+	session_embeddings 		= model.get_layer('Session_Embeddings').get_weights()[0]
+	user_embeddings 		= model.get_layer('User_Embeddings').get_weights()[0]
+	u_emb = {}
+	s_emb = {}
+	for song in vocab:
+		u_emb[song] = user_embeddings[song2ix[song]]
+		s_emb[song] = session_embeddings[song2ix[song]]
+	del model
+	gc.collect()
+	return u_emb, s_emb

project/models/seq2seq.py ADDED Viewed

	@@ -0,0 +1,201 @@

+import numpy as np
+import pandas as pd
+from project.data.preprocess        import gen_seq_files
+from os.path                        import exists
+from keras.models                   import Model
+from keras.callbacks                import EarlyStopping
+from keras.layers                   import Dense, CuDNNLSTM, CuDNNGRU, Embedding, Input, SimpleRNN
+def read_input_targets(path, win_size, t):
+    d = {}
+    if t == 'session':
+        f    = open(path + 'c_seqs.csv')
+        s_i = []
+        for line in f:
+            l = line.rstrip('\n').split('\t')
+            x = ' '.join(l[1].replace('[', '').replace(']', '').split(','))
+            s_i.append(x)
+            if l[0] in d:
+                d[l[0]].append(x)
+            else:
+                d[l[0]] = [x]
+        f.close()
+        s_t      = ['START_ ' + session + ' _END' for session in s_i]
+        return s_i, s_t, d
+    if t == 'listening':
+        f    = open(path + 'u_seqs.csv')
+        s_i = []
+        for line in f:
+            l = line.rstrip('\n').split('\t')
+            x = ' '.join(l[1].replace('[', '').replace(']', '').split(','))
+            s_i.append(x)
+            if l[0] in d:
+                d[l[0]].append(x)
+            else:
+                d[l[0]] = [x]
+        f.close()
+        s_t = ['START_ ' + session + ' _END' for session in s_i]
+        return s_i, s_t, d
+def get_unique_songs(s_i, s_t):
+    all_i   = set()
+    all_t   = set()
+    for songs in s_i:
+        for song in songs.split():
+            if song not in all_i:
+                all_i.add(song)
+    for songs in s_t:
+        for song in songs.split():
+            if song not in all_t:
+                all_t.add(song)
+    return sorted(list(all_i)), sorted(list(all_t))
+def get_max_length(s_i, s_t):
+    max_i = np.max([len(session.split()) for session in s_i])
+    max_t = np.max([len(session.split()) for session in s_t])
+    return max_i, max_t
+def get_dicts(i_songs, t_songs):
+    song_ix_i = dict([(song, i+1) for i, song in enumerate(i_songs)])
+    song_ix_t = dict([(word, i+1) for i, word in enumerate(t_songs)])
+    ix_song_i = dict((i, song) for song, i in song_ix_i.items())
+    ix_song_t = dict((i, song) for song, i in song_ix_t.items())
+    return song_ix_i, song_ix_t, ix_song_i, ix_song_t
+def __run_s2s(sessions_i, sessions_t, num_songs, song_ix, max_l, NUM_DIM=128, BATCH_SIZE= 128, EPOCHS=50, MODEL='RNN', WINDOW_SIZE=5):
+    X, y                                 = sessions_i, sessions_t
+    num_encoder_songs, num_decoder_songs = num_songs
+    song_ix_i, song_ix_t                 = song_ix
+    max_length_i, max_length_t           = max_l
+    def generate_batch(X, y, batch_size= 128):
+        while True:
+            for j in range(0, len(X), batch_size):
+                encoder_input_data = np.zeros((batch_size, max_length_i), dtype='float32')
+                decoder_input_data = np.zeros((batch_size, max_length_t), dtype='float32')
+                decoder_target_data = np.zeros((batch_size, max_length_t, num_decoder_songs), dtype='float32')
+                for i, (input_sequence, target_sequence) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
+                    for t, word in enumerate(input_sequence.split()):
+                        encoder_input_data[i, t] = song_ix_i[word] if word != '-' else 0
+                    for t, word in enumerate(target_sequence.split()):
+                        if t < len(target_sequence.split()) - 1:
+                            decoder_input_data[i, t] = song_ix_t[word] if word != '-' else 0
+                        if t > 0:
+                            decoder_target_data[i, t - 1, song_ix_t[word] if word != '-' else 0] = 1
+                yield([encoder_input_data, decoder_input_data], decoder_target_data)
+    np.random.shuffle(X)
+    np.random.shuffle(y)
+    X_train, X_test = X[int(len(X) *.1):], X[:int(len(X) *.1)]
+    y_train, y_test = y[int(len(y) *.1):], y[:int(len(y) *.1)]
+    TRAIN_SAMPLES   = len(X_train)
+    VAL_SAMPLES     = len(X_test)
+    ENCODER_INPUT       = Input(shape=(None,))
+    ENCODER_EMBEDDING   = Embedding(num_encoder_songs, NUM_DIM)(ENCODER_INPUT)
+    if MODEL == 'LSTM':
+        ENCODER_NN          = CuDNNLSTM(NUM_DIM, return_state=True)
+        _, state_h, state_c = ENCODER_NN(ENCODER_EMBEDDING)
+        ENCODER_STATE       = [state_h, state_c]
+    if MODEL == 'GRU':
+        ENCODER_NN          = CuDNNGRU(NUM_DIM, return_state=True)
+        _, ENCODER_STATE    = ENCODER_NN(ENCODER_EMBEDDING)
+    if MODEL == 'RNN':
+        ENCODER_NN          = SimpleRNN(NUM_DIM, return_state=True)
+        _, ENCODER_STATE    = ENCODER_NN(ENCODER_EMBEDDING)
+    DECODER_INPUT       = Input(shape=(None,))
+    DECODER_EMBEDDING   = Embedding(num_decoder_songs, NUM_DIM)(DECODER_INPUT)
+    if MODEL == 'LSTM':
+        DECODER_NN          = CuDNNLSTM(NUM_DIM, return_sequences=True, return_state=True)
+        DECODER_OUTPUT,_,_  = DECODER_NN(DECODER_EMBEDDING, initial_state=ENCODER_STATE)
+    if MODEL == 'GRU':
+        DECODER_NN          = CuDNNGRU(NUM_DIM, return_sequences=True, return_state=True)
+        DECODER_OUTPUT,_    = DECODER_NN(DECODER_EMBEDDING, initial_state=ENCODER_STATE)
+    if MODEL == 'RNN':
+        DECODER_NN          = SimpleRNN(NUM_DIM, return_sequences=True, return_state=True)
+        DECODER_OUTPUT,_    = DECODER_NN(DECODER_EMBEDDING, initial_state=ENCODER_STATE)
+    DENSE_DECODER       = Dense(num_decoder_songs, activation='softmax')
+    DECODER_OUTPUT      = DENSE_DECODER(DECODER_OUTPUT)
+    es = EarlyStopping(monitor='val_acc', mode='max', verbose=1, patience=5)
+    model               = Model([ENCODER_INPUT, DECODER_INPUT], DECODER_OUTPUT)
+    model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])
+    model.summary()
+    model.fit_generator(generator= generate_batch(X_train, y_train, batch_size= BATCH_SIZE),
+                        steps_per_epoch= TRAIN_SAMPLES // BATCH_SIZE,
+                        epochs=EPOCHS,
+                        validation_data= generate_batch(X_test, y_test, batch_size= BATCH_SIZE),
+                        validation_steps= VAL_SAMPLES // BATCH_SIZE, callbacks=[es])
+    return Model(ENCODER_INPUT, ENCODER_STATE), generate_batch
+def start(df, conf, id, ds):
+    s2s = conf
+    if not exists('dataset/{}/u_seqs.csv'.format(ds)):
+        print('Files %s and %s are going to be at "%s"' % ('u_seqs.csv', 'c_seqs.csv', 'dataset/{}/'.format(ds)))
+        gen_seq_files(df, 'dataset/{}/'.format(ds), conf['window_size'])
+    songs       = df.song.unique()
+    del df
+    sessions_i, sessions_t, song_seqs_ses   = read_input_targets('dataset/{}/'.format(ds), s2s['window_size'], 'session')
+    listening_i,listening_t, song_seqs_list = read_input_targets('dataset/{}/'.format(ds), s2s['window_size'], 'listening')
+    input_songs,    target_songs            = get_unique_songs(listening_i, listening_t)
+    max_length_i,   max_length_t            = get_max_length(listening_i, listening_t)
+    num_encoder_songs, num_decoder_songs    = len(input_songs) + 1, len(target_songs) + 1
+    song_ix_i, song_ix_t, _, _              = get_dicts(input_songs, target_songs)
+    model, gen = __run_s2s(listening_i, listening_t, (num_encoder_songs, num_decoder_songs), (song_ix_i, song_ix_t),
+          (max_length_i, max_length_t), NUM_DIM=s2s['vector_dim'], BATCH_SIZE=s2s['batch_size'], EPOCHS=s2s['epochs'],
+           MODEL=s2s['model'], WINDOW_SIZE=s2s['window_size'])
+    embeddings  = []
+    for song in songs:
+        seqs = song_seqs_list[song]
+        get_seq = gen(seqs, ['START_ ' + seq + ' _END' for seq in seqs], batch_size=1)
+        seq_embeddings = []
+        i=0
+        for (input_seq, _), _ in get_seq:
+            if i == len(seqs):
+                break
+            if s2s['model'] == 'LSTM':
+                state, _ = model.predict(input_seq)
+            else:
+                state    = model.predict(input_seq)
+            seq_embeddings.append(state[0])
+            i+=1
+        emb_final = np.mean(np.array(seq_embeddings), 0)
+        embeddings.append(emb_final)
+    emb_values = np.array([songs, embeddings])
+    np.save('tmp/{}/models/{}'.format(ds, id), emb_values)
+    ######################################################################################################################
+    model, gen = __run_s2s(sessions_i, sessions_t, (num_encoder_songs, num_decoder_songs), (song_ix_i, song_ix_t),
+          (max_length_i, max_length_t), NUM_DIM=s2s['vector_dim'], BATCH_SIZE=s2s['batch_size'], EPOCHS=s2s['epochs'],
+           MODEL=s2s['model'], WINDOW_SIZE=s2s['window_size'])
+    embeddings  = []
+    for song in songs:
+        seqs = song_seqs_ses[song]
+        get_seq = gen(seqs, ['START_ ' + seq + ' _END' for seq in seqs], batch_size=1)
+        seq_embeddings = []
+        i=0
+        for (input_seq, _), _ in get_seq:
+            if i == len(seqs):
+                break
+            if s2s['model'] == 'LSTM':
+                state, _ = model.predict(input_seq)
+            else:
+                state    = model.predict(input_seq)
+            seq_embeddings.append(state[0])
+            i+=1
+        emb_final = np.mean(np.array(seq_embeddings), 0)
+        embeddings.append(emb_final)
+    emb_values = np.array([songs, embeddings])
+    np.save('tmp/{}/models/s{}'.format(ds, id), emb_values)

project/models/setups.py ADDED Viewed

	@@ -0,0 +1,65 @@

+class Setups():
+    def __init__(self, config):
+        self.__config = config
+        self.models_config = config['models']
+    def get_config(self):
+        return self.__config
+    def rnn_setups(self):
+        c = self.models_config['rnn']
+        for m in c['model']:
+            for w in c['window']:
+                for n in c['num_units']:
+                    for e in c['embedding_dim']:
+                        for ep in c['epochs']:
+                                for bi in c['bi']:
+                                    yield { 'window': int(w), 'model': m, 'dim': int(e), 'batch': int(c['batch']),
+                                            'epochs': int(ep), 'num_units': int(n), 'bidi': bi}
+    def d2v_m2v_setups(self, model):
+        c = self.models_config[model]
+        for w in c['window']:
+            for sample in c['negative_sample']:
+                for down in c['down_sample']:
+                    for lr in c['learning_rate']:
+                        for ep in c['epochs']:
+                            for dim in c['embedding_dim']:
+                                yield { 'window': w, 'dim': int(dim), 'lr': float(lr), 'down': float(down), 'epochs': int(ep),  'neg_sample': float(sample)}
+    def glove_setups(self):
+        c = self.models_config['glove']
+        for w in c['window']:
+            for dim in c['embedding_dim']:
+                for lr in c['learning_rate']:
+                    for ep in c['epochs']:
+                        yield { 'window': int(w), 'dim': int(dim), 'lr': float(lr), 'epochs': int(ep)}
+    def genre_setups(self):
+        c = self.models_config['genres']
+        for a in c['all']:
+            yield '{}-{}'.format(a, 'all')
+        for r in c['ran']:
+            yield '{}-{}'.format(r, 'ran')
+    def __return_gen(self, model):
+        if model == 'rnn':
+            return self.rnn_setups()
+        if model == 'music2vec' or model == 'doc2vec':
+            return self.d2v_m2v_setups(model)
+        if model == 'glove':
+            return self.glove_setups()
+        if model == 'genres':
+            return self.genre_setups()
+    def get_generators(self):
+        generators = []
+        for emb_methods in self.__config['embeddings'].items():
+            k, v = emb_methods
+            if v['usage'] == True:
+                generators.append((k, self.__return_gen(k)))
+        return generators
+    def setup_to_string(self, id, setup_obj, model_type):
+        setup_str = '--'.join([x + ':' + str(y) for x,y in list(setup_obj.items())])
+        return '{}--{}--{}'.format(model_type, id, setup_str)

project/recsys/algorithms.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+import sys
+import time
+import yaml
+import pickle
+import multiprocessing                      as mp
+import numpy                                as np
+from project.evaluation.metrics             import get_metrics
+from datetime                               import datetime
+from sklearn.metrics.pairwise               import cosine_similarity
+def write_rec(pwd, sessions):
+    f = open(pwd, 'wb')
+    pickle.dump(sessions, f, protocol=pickle.HIGHEST_PROTOCOL)
+    f.close()
+def recs(session, original, mtn_rec, smtn_rec, csmtn_rec, csmuk_rec):
+    return ({ 'session': session, 'original': original, 'mtn_rec': mtn_rec.tolist(), 'smtn_rec': smtn_rec.tolist(), 'csmtn_rec': csmtn_rec.tolist(),  'csmuk_rec': csmtn_rec.tolist()})
+def execute_algo(train, test, songs, topN, k_sim, data, pwd):
+    m2vTN   = []
+    sm2vTN  = []
+    csm2vTN = []
+    csm2vUK = []
+    u_songs  = data.us_matrix()
+    users    = data.uu_matrix()
+    def report_users(num_users):
+        def f_aux(ix_user, user_id, algo):
+            return '[{}/{}] Running algorithm {} for user {}!'.format(ix_user, num_users,algo, user_id)
+        return f_aux
+    num_users   = len(test)
+    rep         = report_users(num_users)
+    u           = 1
+    def pref(u, k_similar, song):
+        listened_to = [(k, u_songs[k, data.song_ix(song)] == 1) for k in k_similar]
+        sum_sims = 0
+        for u_k, listen in listened_to:
+            if listen == True:
+                sum_sims += users[u][u_k] / [v[1] for v in listened_to].count(True)
+        return sum_sims
+    for user in test:
+        f = open(pwd + '/' + user.replace('/', '_'), 'wb')
+        pickle.dump({}, f, protocol=pickle.HIGHEST_PROTOCOL)
+        f.close()
+        print(rep(u, user, 'M-TN'), flush=False, end='\r')
+        user_cos = cosine_similarity(data.u_pref(user).reshape(1, -1), data.m2v_songs)[0]
+        user_tn  = data.get_n_largest(user_cos, topN)
+        sim_ix   = np.argpartition(users[data.ix_user(user)], -k_sim)[-k_sim:]
+        song_sim = np.array([pref(data.ix_user(user), sim_ix, s) for s in songs.index.values])
+        to_write = []
+        s = 1
+        sessions = data.user_sessions(user)
+        for (train_songs, test_songs) in sessions:
+            if len(train_songs) > 0:
+                m2vTN.append(get_metrics(user_tn, test_songs))
+                c_pref  = data.c_pref(train_songs)
+                print(rep(u, user, 'SM-TN'), flush=False, end='\r')
+                con_cos = cosine_similarity(c_pref.reshape(1, -1), data.sm2v_songs)[0]
+                cos_tn  = data.get_n_largest(con_cos, topN)
+                sm2vTN.append(get_metrics(cos_tn, test_songs))
+                print(rep(u, user, 'CSM-TN'), flush=False, end='\r')
+                f_cos   = np.sum([user_cos, con_cos], axis=0)
+                both_tn = data.get_n_largest(f_cos, topN)
+                csm2vTN.append(get_metrics(both_tn, test_songs))
+                print(rep(u, user, 'CSM-UK'), flush=False, end='\r')
+                UK_cos  = np.sum([song_sim, con_cos], axis=0)
+                uk_tn   = data.get_n_largest(UK_cos, topN)
+                csm2vUK.append(get_metrics(uk_tn, test_songs))
+                to_write.append(recs(s, test_songs, user_tn, cos_tn, both_tn, uk_tn))
+                s+=1
+        write_rec(pwd + '/' + user.replace('/', '_'), to_write)
+        u+=1
+    m_m2vTN     = np.mean(m2vTN, axis=0).tolist()
+    m_sm2vTN    = np.mean(sm2vTN, axis=0).tolist()
+    m_csm2vTN   = np.mean(csm2vTN, axis=0).tolist()
+    m_csm2vUK   = np.mean(csm2vUK, axis=0).tolist()
+    return (m_m2vTN, m_sm2vTN, m_csm2vTN, m_csm2vUK)

project/recsys/helper.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+import numpy as np
+import math
+from sklearn.metrics.pairwise               import cosine_similarity
+import warnings
+class Helper():
+    def __init__(self, train, test, songs, ds):
+    	self.ds              = ds
+    	self.train           = train
+    	self.test            = test
+    	self.songs           = songs
+    	self.m2v_songs       = self.songs.m2v.tolist()
+    	self.sm2v_songs      = self.songs.sm2v.tolist()
+    	self.songs_ix        = { v:k for k,v in enumerate(songs.index, 0) }
+    	self.ix_songs        = { k:v for k,v in enumerate(songs.index, 0) }
+    	self.ix_users        = { v:k for k,v in enumerate(np.concatenate([train.index.values, test.index.values]).tolist(), 0)   }
+    	self.num_users       = len(self.ix_users)
+    	self.num_songs       = len(songs.index)
+    	self.ix_pref         = { v:self.u_pref(k) for (k,v) in self.ix_users.items() }
+    	self.ix_u_songs      = { v:self.unique_songs(k) for (k,v) in self.ix_users.items() }
+    def user_sessions(self, user):
+        history = self.test.loc[user, 'history']
+        return [(s[:len(s)//2], s[len(s)//2:]) for s in history]
+    def song_ix(self, song):
+        return self.songs_ix[song]
+    def ix_user(self, ix):
+        return self.ix_users[ix]
+    def unique_songs(self, user):
+        if user in self.train.index:
+            history = self.train[self.train.index == user]['history'].values[0]
+        if user in self.test.index:
+            history = self.test[self.test.index == user]['history'].values[0]
+        flat_history = [song for session in history for song in session]
+        unique_songs = list(set(flat_history))
+        return unique_songs
+    def u_pref(self, user):
+    	if user in self.train.index:
+      		history = self.train[self.train.index == user]['history'].values[0]
+    	if user in self.test.index:
+      		history = self.test[self.test.index == user]['history'].values[0]
+      		history = [s[:len(s)//2] for s in history]
+    	flat_history = [song for session in history for song in session]
+    	flat_history = [self.songs.loc[song, 'm2v'] for song in flat_history]
+    	mean         = np.mean(flat_history, axis=0)
+    	return mean
+    def c_pref(self, songs):
+        flat_vecs       = self.songs.loc[songs, 'sm2v'].tolist()
+        return np.mean(np.array(flat_vecs), axis=0)
+    def get_n_largest(self, cos,n):
+        songs = self.songs.index.values
+        index = np.argpartition(cos, -n)[-n:]
+        return songs[index]
+    def uu_matrix(self):
+        if os.path.isfile('tmp/{}/matrix_users.npy'.format(self.ds)):
+            return np.load('tmp/{}/matrix_users.npy'.format(self.ds))
+        matrix_users    = np.zeros((self.num_users, self.num_users))
+        for ix in range(self.num_users):
+            u_array = np.array([self.ix_pref[i] for i in range(self.num_users)])
+            y_array = np.zeros(self.num_users)
+            for j in range(self.num_users):
+                y_array[j] = math.sqrt(len(self.ix_u_songs[ix]) + len(self.ix_u_songs[j]))
+            cos = cosine_similarity(self.ix_pref[ix].reshape(1, -1), u_array)
+            val = np.sum([cos, y_array], axis=0)
+            matrix_users[ix] = np.divide(np.ones(val.shape), val)
+        np.save('tmp/{}/matrix_users'.format(self.ds), matrix_users)
+        return matrix_users
+    def us_matrix(self):
+        if os.path.isfile('tmp/{}/matrix_user_songs.npy'.format(self.ds)):
+            return np.load('tmp/{}/matrix_user_songs.npy'.format(self.ds))
+        matrix_u_songs  = np.zeros((self.num_users, self.num_songs))
+        for u in list(self.ix_u_songs.keys()):
+            songs = self.ix_u_songs[u]
+            songs_ids = [self.songs_ix[s] for s in songs]
+            y_array = np.zeros(self.num_songs)
+            y_array[songs_ids] = 1
+            matrix_u_songs[u] = y_array
+        np.save('tmp/{}/matrix_user_songs'.format(self.ds), matrix_u_songs)
+        return matrix_u_songs